Sorted list of counters in constant time

Question

Summary.

A data structure maintains in constant time a sorted list of counter values, for a dynamic set of counters. I am interested in references using this structure, and in possible improvements.

Problem and motivation.

Consider a set of counters that may be increased (by 1), decreased (by 1), deleted (when at 0), or created (with value 0). We want to efficiently update the sorted list of counter values when these operations are performed.

This is helpful for many counting tasks in dynamic contexts where we want to have the maximal value or the median value at any time. Typical examples include counting item occurrences within a bounded time window in any stream (one counter per item), the degrees in dynamic graphs (one counter per vertex/node), etc.

Data structure.

values is the array of counter values sorted in decreasing order: it has one cell per counter, values[0] is the largest counter value, values[1] the second largest (that may be equal to the first one), and so on.

c2pos is the dictionary giving the index of counters in values: c2pos[c] gives the index of the value of c, therefore the value of c is values[c2pos[c]].

pos2c is the array that gives the counter corresponding to index i in values.

val2pos is the dictionary giving the smallest index val2pos[v] of a counter with value v in values.

distrib is the dictionary giving the number of counters of value v, for any v

All structures above are initialized to empty.

All counters are created with initial value 0, and deleted only when their value is 0.

Algorithm.

Addition of counter c:

if distrib[0] does not exist then set it to 0
if val2pos[0] does not exist then set it to the length of values
increase distrib[0]
set c2pos[c] to the length of values
append c to pos2c
append 0 to values

Removal of counter c:

decrease distrib[0]
if it becomes 0 then remove entry 0 from distrib and val2pos
swap c and the last counter in values
remove entry c from c2pos
remove the last cell of pos2c
remove the last cell of values

Increase counter c:

let v be the value of c
swap c and the first counter in values with value v
if distrib[v+1] does not exist, then set it to 0
if val2pos[v+1] does not exist, then set it to val2pos[v]
increase distrib[v+1]
decrease distrib[v]
increase val2pos[v]
if distrib[v] equals 0 remove entry v from distrib and val2pos
increase values[c2pos[c]]

Decrease counter c:

let v be the value of c
swap c and the last counter in values with value v
if distrib[v-1] does not exist, then set it to 0
if val2pos[v-1] does not exist, then set it to val2pos[v]+distrib[v]
increase distrib[v-1]
decrease distrib[v]
increase val2pos[v]
decrease val2pos[v-1]
if distrib[v] equals 0 remove entry v from distrib and val2pos
decrease values[c2pos[c]]

Complexity.

If dictionaries are implemented as hash tables, then the expected time of dictionary operations is $O(1)$. If counters are never deleted and if their values are bounded, then one may use arrays instead of hash tables, leading to a $O(1)$ worst case cost for dictionary operations.

If we use dynamic arrays, then array operations are in $O(1)$ amortized time. If we know the total number of counters in advance (or a bound), then the worst case complexity is $O(1)$.

Space complexity is linear with the number of counters, as the arrays and dictionaries contain one entry per counter, or one entry per counter value at most.

Questions.

Is this a well known method (maybe folklore?)? Where does it appear in the literature?

Is it possible to significantly improve it? For instance, are all the mentionned arrays and dictionaries mandatory? I wanted to avoid distrib but something of this kind seems necessary to avoid memory cost to grow with the maximal counter value.

rici · Answer 1 · 2022-05-13T03:38:20.760

I've implemented this data structure, or something very similar to it, various times, without using hash tables (or associative mappings) at all. As far as I know, it's a folklore algorithm, although I suppose it must exist somewhere in the literature, and hash tables weren't the first choice in languages which don't have them as a primitive data structure.

(When I say that it doesn't use hash tables, I mean that it doesn't use them internally. It's often convenient to use a hash table to find a counter given some identifier for it, but that's not always necessary. Here's a StackOverflow answer from 2016 and an even older one where I present this data structure; these do assume the use of a hash table to find the counters.)

I usually call it a "comb table" because it consists of tines, all with the same count, hanging off of a backbone of count objects. Each count object holds a count and an indication of the first and last counter which currently has that count. The counters themselves are either stored in an array (in which case they move around as the algorithm progresses, which needs to be tracked in some way), or they're stored in a double-linked list. If they're stored in an array, they will be in descending order so that new counters are born at the end of the array. Each counter has a reference to the count object; they don't hold their current count directly. They might also contain other information, of course.

Since the count has a reference to the first and last counter, and each counter has a reference to its count object as well as a mechanism for finding its predecessor and successor counter, it's not necessary to link the count objects. From a count object you can reach the next larger count by getting the count object of the successor of the last counter in the count object's list, and similarly for the next smaller count. By the same token, the application can find the zero-count object by following the reference from the first counter in the linked list of counters (or the counter at the top of the counter array).

The increment and decrement count operations are slightly more complicated than in your arrangement, but are clearly O(1) and follow basically the same steps. There are two paths, one of which applies when the count object is singular (has only a single counter) or multiple. Singular count objects are easy to recognise since their first and last counter references are the same.

To increment or decrement a counter whose count object is singular, we first increment or decrement the count object's count. We then check whether the new count is the same as the next count object (or the previous one, for decrement). If that's the case, the count object is deleted; the counter is repointed at the successor count object, whose first (or last) counter references is updated.

For a counter whose count object is multiple, we first move the counter to the appropriate end of the chain of counters with the same count object; if we're using a linked list, that involves four link modifications; if we're using an array, it's a simple swap, although we might also have to update whatever data structure is being used to find the counter. We then remove it from the chain for its current count object by modifying the last (first) counter reference of that count object. Finally, we check whether the next (previous) counter's count object has the consecutive count. If so, we only need to modify the first (last) counter reference that count object. Otherwise, we need to create a new count object and set both of its counter references to the counter whose value is being modified.

Count objects only exist for counts which have at least one counter, and there is only one count object per count value. So there are never more count objects than counters (and usually a lot fewer). It's possible to allocate counters and count objects together, even as a single memory allocation (although there's no relationship between the two parts of the allocation); since, there are usually fewer count objects in use, that means we'll need to thread a free-list through the unused count objects.

None of this is really limited to the case where increment and decrement are always by one. It's possible to handle problem domains with bounded increment and decrement, by just tracking through the successive counts until you find the right one, or the right place to insert one. That's still O(1), although you'd probably want the bound to be small. I've done this for increment bounds of two or three, and it's still competitive.

I don't know if that answers any question you had, but it's maybe interesting.

score 1 · Answer 2 · answered Oct 02 '24 at 08:09

A similar data structure is used in Fast algorithms for determining (generalized) core groups in social networks by Batagelj and Zaveršnik, published in Advances in Data Analysis and Classification in 2010.

The method is also described in their 2003 (unpublished?) preprint An O(m) Algorithm for Cores Decomposition of Networks.

Sorted list of counters in constant time

2 Answers2

Linked