2

I have two series of tuples

$$ A = \{(x_0,t_{01},t_{02}),(x_1,t_{11},t_{12}),\ldots,(x_n,t_{n1},t_{n2})\} \\ B = \{(y_0,tt_{01},tt_{02}),(y_1,tt_{11},tt_{12}),\ldots,(y_m,tt_{m1},tt_{m2})\} $$

where $m \neq n$ and $t$ and $tt$ are both time observations. $t_{i1}$ and $tt_{j1}$ are both the start of some epoch for arbitrary $i$ and $j$. $t_{i2}$ and $tt_{j2}$ are both the end of some epoch for arbitrary $i$ and $j$. $x_i$ and $y_j$ are the observations of a sensor during those epochs.

In my case $t_{i1}$ and $tt_{j1}$ are not guaranteed to match, nor are $t_{i2}$ and $tt_{j2}$. However, they do represent time epochs that overlap with each other. For instance if $t_{11}$ and $t_{12}$ stretches from 2015-01-02 to 2015-01-03 then $tt_{11}$ and $tt_{12}$ might go from 2015-01-01 to 2015-01-04.

What I'm hoping to do is align the observations $x_i$ and $y_j$ temporally into a single sequence that might look like

$$ C = \{(x_0,y_0,ttt_{01},ttt_{02}),(x_1,y_1,ttt_{11},ttt_{12}),\ldots,(x_p,y_p,ttt_{p1},ttt_{p2})\}, $$

where $p$ is the last observation in the sequence and the time observations $ttt_{k1}$ and $ttt_{k2}$ are the start and end of some epoch non-overlapping with later observations of $ttt_{\mathit{whatever}}$.

I've already looked into some of the papers available on the subject and all of the work I've seen deals with image analysis, audio overlap, kernels, and others that seem a bit off focus than what I'm looking for. Is there more basic theory available that deal specifically with the situation I'm describing here? Or is my case just a specific one covered by some of the papers that I've been reading on?

Yuval Filmus
  • 280,205
  • 27
  • 317
  • 514
Greg
  • 123
  • 4

1 Answers1

2

OK. So the problem is as follows (with different notation):

Inputs: disjoint intervals $I_1,\dots,I_k$; disjoint intervals $J_1,\dots,J_m$
Output: disjoint intervals $K_1,\dots,K_n$ that are a "refinement" of the $I,J$-intervals

The $I$-intervals are not necessarily disjoint from the $J$-intervals. We want the $K$-intervals to have the following property: if $x,y$ are in the same $K$-interval, then $x,y$ are in the same $I$-interval and $x,y$ are in the same $J$-interval.

This can be solved. Basically, you use a 'merge' procedure vaguely akin go the merge operation from Mergesort, but adjusted to deal with intervals. I'll assume the $I$-intervals are in sorted order, and the $J$-intervals are in sorted order (so $I_1$ is the leftmost/smallest interval of the $I$-intervals, etc.).

The algorithm considers how $I_1$ compares to $J_1$:

  • If $I_1$ is wholly to the left of $J_1$ (they don't overlap, and $I_1$ is smaller than $J_1$), then we output $I_1$, delete $I_1$ from the $I$-list, and continue.

  • If $I_1$ is wholly to the right of $J_1$ (they don't overlap, and $I_1$ is larger than $J_1$), then we output $J_1$, delete $J_1$ from the $J$-list, and continue.

  • If $I_1,J_1$ overlap and $I_1$ is contained in $J_1$, then do the following: if $I_1$ is the interval $[a,b]$ and $J_1$ is the interval $[c,d]$ (so $c<a<b<d$), output the interval $[c,a]$, output the interval $[a,b]$, delete $I_1$ from the $I$-list, delete $J_1$ from the $J$-list, and insert $[b,d]$ to the front of the $J$-list.

  • If $I_1,J_1$ overlap and $J_1$ is contained in $I_1$, do the symmetrical equivalent of the previous case.

  • If $I_1,J_1$ overlap but neither is wholly contained in the other, and the left endpoint of $I_1$ is to the left of the left endpoint of $J_1$, do the following: if $I_1$ is the interval $[a,b]$ and $J_1$ is the interval $[c,d]$ (so that $a<c<b<d$), output the interval $[a,c]$, output the interval $[c,b]$, delete $I_1$ from the $I$-list, delete $J_1$ from the $J$-list, and insert $[b,d]$ to the front of the $J$-list.

  • If $I_1,J_1$ overlap but neither is wholly contained in the other, and the left endpoint of $J_1$ is to the left of the left endpoint of $I_1$, do the symmetrical equivalent of the previous case.

As you can see, once you've sorted the intervals, this can be handled by a linear scan using a case analysis. The total number of intervals in the output will be at most twice the total number of intervals in the input (i.e., $n \le 2(k+m)$).

D.W.
  • 167,959
  • 22
  • 232
  • 500