12

Alice, a student, has a lot of homework over the next weeks. Each item of homework takes her exactly one day. Each item also has a deadline, and a negative impact on her grades (assume a real number, bonus points for only assuming comparability), if she misses the deadline.

Write a function that given a list of (deadline, grade impact) figures out a schedule for which homework to do on which day that minimizes the sum of bad impact on her grades.

All homework has to be done eventually, but if she misses a deadline for an item, it doesn't matter how late she turns it in.

In an alternative formulation:

ACME corp wants to supply water to customers. They all live along one uphill street. ACME has several wells distributed along the street. Each well bears enough water for one customer. Customers bid different amounts of money to be supplied. The water only flows downhill. Maximize the revenue by choosing which customers to supply.

We can sort the deadlines using bucket sort (or just assume we have already sorted by deadline).

We can solve the problem easily with a greedy algorithm, if we sort by descending grade impact first. That solution will be no better than O(n log n).

Inspired by the Median of Medians and randomized linear minimum spanning tree algorithms, I suspect that we can solve my simple scheduling / flow problem in (randomized?) linear time as well.

I am looking for:

  • a (potentially randomized) linear time algorithm
  • or alternatively an argument that linear time is not possible

As a stepping stone:

  • I have already proven that just knowing which items can be done before their deadline, is enough to reconstruct the complete schedule in linear time. (That insight is underlying the second formulation where I am only asking about certificate.)
  • A simple (integral!) linear program can model this problem.
  • Using duality of this program, one can check a candidate proposed solution in linear time for optimality, if one is also given the solution to the dual program. (Both solutions can be represented in a linear number of bits.)

Ideally, I want to solve this problem in a model that only uses comparison between grade impacts, and does not assume numbers there.

I have two approaches to this problem---one based on treaps using deadline and impact, the other QuickSelect-like based on choosing random pivot elements and partitioning the items by impact. Both have worst cases that force O(n log n) or worse performance, but I haven't been able to construct a simple special case that degrades the performance of both.

Matthias
  • 129
  • 10

3 Answers3

1

A few things I found out so far.

We can reduce ourselves to solving the following related problem:

newtype Slot = Slot Int
newtype Schedule a = Schedule [(Slot, [a])]

findSchedule :: Ord a => Schedule a -> Schedule (a, Bool)

I.e. give the input data already sorted by deadline, but allow an arbitrary non-negative number of tasks to be done on each day. Give the output by just marking the elements on whether they can be scheduled in time, or not.

The following function can check whether a schedule given in this format is feasible, ie whether all items still in the schedule can be scheduled before their deadlines:

leftOverItems :: Schedule a -> [Int]
leftOverItems (Schedule sch) = scanr op 0 sch where
  op (Slot s, items) itemsCarried = max 0 (length items - s + itemsCarried)

feasible schedule = head (leftOverItems schedule) == 0

If we have a proposed candidate solution, and all items left out, we can check in linear time whether the candidate is optimal, or whether there are any items in the left-out set that would improve the solution. We call these light items, in analogy to the terminology in Minimum Spanning Tree algorithm

carry1 :: Ord a => Schedule a -> [Bound a]
carry1 (Schedule sch) = map (maybe Top Val . listToMaybe) . scanr op [] $ sch where
  op (Slot s, items) acc = remNonMinN s (foldr insertMin acc items)

-- We only care about the number of items, and the minimum item.
-- insertMin inserts an item into a list, keeping the smallest item at the front.
insertMin :: Ord a => a -> [a] -> [a]
insertMin a [] = [a]
insertMin a (b:bs) = min a b : max a b : bs

-- remNonMin removes an item from the list,
-- only picking the minimum at the front, if it's the only element.
remNonMin :: [a] -> [a]
remNonMin [] = []
remNonMin [x] = []
remNonMin (x:y:xs) = x : xs

remNonMinN :: Int -> [a] -> [a]
remNonMinN n l = iterate remNonMin l !! n

data Bound a = Bot | Val a | Top
  deriving (Eq, Ord, Show, Functor)

-- The curve of minimum reward needed for each deadline to make the cut:
curve :: Ord a => Schedule a -> [Bound a]
curve = zipWith min <$> runMin <*> carry1

-- Same curve extended to infinity (in case the Schedules have a different length)
curve' :: Ord a => Schedule a -> [Bound a]
curve' = ((++) <*> repeat . last) . curve

-- running minimum of items on left:
runMin :: Ord a => Schedule a -> [Bound a]
runMin = scanl1 min . map minWithBound . items . fmap Val

minWithBound :: Ord a => [Bound a] -> Bound a
minWithBound = minimum . (Top:)

-- The pay-off for our efforts, this function uses
-- the candidate solution to classify the left-out items
-- into whether they are definitely _not_ in
-- the optimal schedule (heavy items), or might be in it (light items).
heavyLight :: Ord a => Schedule a -> Schedule a -> ([[a]],[[a]])
heavyLight candidate leftOut =
    unzip . zipWith light1 (curve' candidate) . items $ leftOut
  where
    light1 pivot = partition (\item -> pivot < Val item)

heavyLight not only checks a proposed schedules for optimality, it also gives you a list of items that can improve a non-optimal schedule.

Matthias
  • 129
  • 10
0

Yes, this problem is solvable in linear time.

We will describe an algorithm and sketch a proof of its runtime.

First, let reduce our concrete problem to a more abstract one:

If there are $n$ items of homework, Alice can use bucket-sort to order them by their deadlines in $O(n)$ time. That's because any item with a deadline further out than $n$ can be treated as having a deadline of $n$. W.l.o.g. that's also the assumption we make from here on.

Now let's do the actual reduction to a sequence of heap operations in $O(n)$ time:

  • Initialise an empty sequence of operations $s$.
  • Count our days as $d$ down from $n$ to $1$:
    • For each task $t$ with $\text{deadline}(t) = d$, append $\text{Insert}(t)$ to $s$.
    • Afterwards append $\text{DeleteMax}$ to $s$, regardless of whether the previous step added any items for this day.

We could now run the sequence of operations $s$ on a binary heap, and the items left over in the heap at the end would be the items that Alice won't do before their deadlines. The algorithm also implicitly matches each deleted item with a $\text{DeleteMax}$ operation, which a bit more book-keeping can turn into a schedule. In general, if you know which items are deleted or kept, you can reconstruct a schedule in linear time, even if you don't know the matching. (We assume that $\text{DeleteMax}$ is ignored if the heap is empty.)

Our hope is that we can forecast the result of these $n$ heap operations in $O(n)$ instead. This is less outlandish than it might seem at first, because executing them one by one is an online algorithm, but we actually know all the operations in advance and only need to learn the final state of the heap. So an offline algorithm might be faster.

Here is where I have to pull the first rabbit out of a hat: soft heaps.

Soft Heaps

Introduction

A soft heap is a data structure that allows us to execute $\text{Insert}$ and $\text{DeleteMax}$ operations in amortised $O(1)$ time. Alas, we don't get that speed for free: as part of the Faustian bargain, the soft heap is allowed to make a certain carefully bounded number of errors.

More specifically, the soft heap sometimes corrupts items. Corruption means an item travels the heap as if it has a smaller apparent priority (for a max heap) than it originally had. (And vice versa for a min heap, corruption always moves items away from the root.) The soft heap respects the heap priority ordering only for apparent priorities.

The guarantee we get is that the number corrupted items is bounded by $O(\varepsilon n)$, where $\varepsilon$ is an error parameter for the soft heap and $n$ is the number of items ever inserted into the heap. Crucially, $n$ is not the number of items currently in the soft heap.

The lower $\varepsilon$ the longer $\text{DeleteMax}$ takes. Specifically, the one of the constant factor hiding in $\text{DeleteMax}$'s $O(1)$ time is $O(1/\log \varepsilon)$, assuming $\varepsilon$ is picked independent of $n$. So the amortised runtime of $\text{DeleteMax}$ is $O(1/\log \varepsilon)$, and the amortised runtime of $\text{Insert}$ is still $O(1)$.

Application

Assume our sequence $s$ of operations has $m$ inserts and $k$ max-deletes. W.l.o.g. we can assume that our deletes never occur when the heap is empty. (Otherwise, we can remove them in an $O(n)$ preprocessing pass that does not even look at priorities.) So $k \leq m$ and there are $r := m - k$ items left in the heap at the end (whether that's a soft heap or a normal one).

At this point, the soft heap will have corrupted up to $\varepsilon m$ items. Leaving $r - \varepsilon m = m - k - \varepsilon m = (1-\varepsilon)m - k$ items in the heap that are not corrupted. Because soft heaps only ever corrupt items to move away from the root, ie 'better' at staying in the heap, one can show that the uncorrupted items left in the soft heap are a subset of the items that would be left in a normal heap.

We can't say anything about the items that got deleted by the soft heap nor about the items that got corrupted.

Overall after running the soft heap, we can produce a new smaller instance of the problem. Its size is $m' := m - (1-\varepsilon)m + k = \varepsilon m + k$ for number of inserts and $k' := k$ for number of deletes (we only remove items from consideration that never get deleted.)

The catch

So far so good, but even with a small $\varepsilon$, and repeated application of our soft heap pass, the size of the problem won't keep shrinking once the number of remaining inserts becomes close to the number of original deletes.

We seem to be stuck.

Back to basics

Going backwards through our list of days wasn't the only choice available to produce a sequence of operations. We could also have gone forwards through the list of days, and for each day $d$:

  • For each task $t$ with $\text{deadline}(t) = d$, append $\text{Insert}(t)$ to $s$.
  • Afterwards, append enough $\text{DeleteMin}$ operations to cut the size of our heap down to $d$ (if necessary).

Conceptually, the backward pass's heap keeps track of items that might still get scheduled. A $\text{DeleteMax}$ operation picks an item to schedule on that day. The forward pass's heap keeps track of items that might still need to be dropped. A $\text{DeleteMin}$ operation picks an item to drop from the schedule, because there were too many other more urgent items with short deadlines that needed to be scheduled first.

In the end, both approaches are equivalent and produce complementary left-over sets in their heaps at the end: either the items that we have to drop (for the forward pass), or the items that we can schedule in time (for the backward pass). Each item will be in exactly one of the left-over sets. (W.l.o.g. we assume that the items have distinct weights, or that ties will be broken in a consistent way.)

We described the forward and backward passes in the context of our concrete problem, but more generally for any sequence of $\text{Insert}$ and $\text{DeleteMax}$ (respectively $\text{Insert}$ and $\text{DeleteMin}$) operations, we can convert between the two representations in linear time (without comparing weights).

Salvation

When the backward pass has $m$ inserts and $k$ deletes in some order, the forward pass has the same $m$ inserts (but in reverse order) interspersed with some $m - k$ deletes.

The backward pass removes at least $(1-\varepsilon)m - k$ items from consideration: they will definitely have to be dropped. The forward pass removes at least $k - \varepsilon m$ different items from consideration: they can definitely be scheduled.

Overall $k$ nicely cancels out, and we are left with at most $m':=2\varepsilon m$ inserts after both passes (and around $k':= \varepsilon m$ deletes for both passes, but as long as $0 \leq k \leq m$ we can ignore the exact value.)

The algorithm

The algorithm is now simple:

  • Run both passes on the original problem instance in a combined $O(m)$ time. Take note of the fate of the items we already nailed down.
  • Reduce the problem instance to $m' \leq 2\varepsilon m$ inserts (and around $k' = \varepsilon m$ deletes).
  • Unless $m'$ went to zero, repeat the process with the new instance.

The problem size shrinks by a constant factor of $2\varepsilon$ each iteration, the cost of each iteration is linear in the remaining problem size. Overall, the well-known sum of the geometric series gives us a total runtime of $O(n)$, for any $\varepsilon < 1/2$.

Appending: Nerding out

The backward and forwarded passes may seem a bit arbitrary. Especially their conversion into each other, which we only sketched in the roughest of details.

They make more sense when viewed from the perspective of matroids and their duals.

Specifically the following family of matroids, which I will call 'heap matroids' reasons that will become clear later:

Fix a ground set $G$ of distinct elements, and a sequence of operations to non-deterministically build a set (starting from an empty set):

  • $\text{Insert}(x)$: add $x$ to the set.
  • $\text{DeleteAny}$: remove any element from the set.

All possible outcomes of this process together form the basis of a matroid over the groundset $G$.

Finding a maximum weight basis over this matroid is equivalent to always deleting the minimum element from our set, ie running a min-heap. Always deleting the maximum element finds a minimum weight basis.

The family of heap matroids is closed under taking the dual. The conversion between backward and forward passes is the duality transformation. Finding a minimum weight basis in a heap matroid is equivalent to finding a maximum weight basis in the dual matroid.

Soft heaps allow us to identify a subset of the optimum basis.

My 'heap matroids' are also known as 'nested matroids' or 'Schubert matroids'. Alas, as far as I can tell, there are conflicting definitions, and only some definitions of 'nested matroids' and 'Schubert matroids' are equivalent to our 'heap matroids'. That's why I coined a new name.

Matthias
  • 129
  • 10
-5

No. This is not a special case of a flow problem solvable in linear time. Because complexity is given by $O(n^2)$ and while sorting itself we get complexity as $O(n\log n)$ and in order to execute all the other n processes complexity definitely wouldnt remain linear.

Jonathan Prieto-Cubides
  • 2,229
  • 3
  • 18
  • 26
Sheetal U
  • 65
  • 5