13

During an interview for a Java developer position, I was asked the following:

Write a function that takes two params:

  1. a String representing a text document and
  2. an integer providing the number of items to return.

Implement the function such that it returns a list of Strings ordered by word frequency, the most frequently occurring word first. Your solution should run in $O(n)$ time where $n$ is the number of characters in the document.

The following is what I answered (in pseudocode), it is not $O(n)$ but rather $O(n \log n)$ time because of the sort. I cannot figure out how to do it $O(n)$ time.

wordFrequencyMap = new HashMap<String, Integer>();
words = inputString.split(' ');

for (String word : words) {
  count = wordFrequencyMap.get(word);
  count = (count == null) ? 1 : ++count;
  wordFrequencyMap.put(word, count);
}

return wordFrequencyMap.sortByValue.keys

Does someone know or can someone give me some hints?

D.W.
  • 167,959
  • 22
  • 232
  • 500
user2712937
  • 131
  • 1
  • 1
  • 3

5 Answers5

13

I suggest a variation of distribution counting:

  1. Read the text and insert all the word encountered into a trie, maintaining in each node a count, how often the word represented by this node has occured. Additionally keep track of the highest word count say maxWordCound . -- $O(n)$
  2. Initialize an array of size maxWordCount. Entry type are lists of strings. -- $O(n)$, since the count can't be higher.
  3. Traverse the trie and for each node add the corresponding string to the array entry indicated by the count. -- $O(n)$, since the total length of strings is bounded by $n$.
  4. Traverse the array in descending order and output the desired number of strings. -- $O(n)$, since that is a bound on both the size of and the amount of data in the array.

You can probably replace the trie by other data structures in the first phase.

Vasif
  • 103
  • 3
FrankW
  • 6,609
  • 4
  • 27
  • 42
3

Your algorithm does not even run in time $O(n \log n)$; inserting $\Theta(n)$ things in a hashtable costs time $\Omega(n^2)$ already (worst-case).


What follows is wrong; I'm leaving it here for the time being for illustrative purposes.

The following algorithm runs in worst-case time $O(n)$ (assuming an alphabet $\Sigma$ of constant size), $n$ the number of characters in the text.

  1. Construct a suffix tree of the text, e.g. with Ukkonen's algorithm.

    If the construction does not already do this, add the number of reachable leaves to every (inner) node.

  2. Traverse the tree from the root and cut off all branches at the first (white)space.

  3. Traverse the tree and sort the list of children of every node by their leaf counts.

  4. The yield of the tree (leaves from left to right) is now a list of all words, sorted by frequency.

Regarding runtime:

  1. Ukkonen's algorithm (in its enhanced form) runs in time $O(n)$; maintaining leaf counts does not increase the $\Theta$-cost of the algorithm.
  2. We have to traverse one node per character of every word that occurs in the text. Since there are at most $n$ different word-character pairs, we visit at most $n$ nodes.
  3. We visit at most $n$ nodes (cf 2.) and spend time $O(|\Sigma| \cdot \log |\Sigma|) = O(1)$ per node.
  4. We can obtain the yield (which has of course size $O(n)$) by a simple traversal in time $O(n)$ (cf 2.).

More precise bounds can be obtained by parametrising runtime with the number of different words; if there are few, the tree is small after 2.

Raphael
  • 73,212
  • 30
  • 182
  • 400
3

The gathering of occurrence counts is O(n), so the trick is really only finding the top k occurrence counts.

A heap is a common way to aggregate the top k values, although other methods can be used (see https://en.wikipedia.org/wiki/Partial_sorting).

Assuming k is second param above, and that it's a constant in the problem statement (it appears to be):

  1. Build a trie of words with occurrence counts on each node.
  2. Initialize a heap of size k.
  3. Traverse the trie and min-probe/insert each (leaf, occurrence-count) pair in the top-k heap.
  4. Output the top k leaves and counts (this is actually kind of a pain because you need parent pointers to map each leaf back to a word).

Since the heap size is a constant, the heap operations are O(1), so step 3 is O(n).

The heap could also be maintained dynamically while building the trie.

KWillets
  • 1,274
  • 8
  • 9
1

Use a hash table (e.g., HashMap) to collect all words and their frequencies. Then, use counting sort to sort the words in order of decreasing frequency. Since all frequencies are integers in the range $1..n$, counting sort takes $O(n)$ time. The total expected running time is $O(n)$, which is more than likely more than sufficient for all practical purposes (unless the interviewer mentioned something that was left out of your question). Make sure to mention that this is expected running time rather than worst-case running time.

This might not be the answer that a teacher would be looking for in an algorithms class, because it is expected $O(n)$ running time rather than $O(n)$ worst-case running time. If you want to score extra points at the interview question, you can mention casually in an off-hand manner that of course this is expected running time, but it can also be done in $O(n)$ worst-case running time by replacing the hash table with a more sophisticated data structure -- and you'd be happy to elaborate on how you'd choose between algorithms in a situation like this.

Or, if you want to play it a bit safer, before giving an answer, first ask "do you care about the difference between expected $O(n)$ running time and worst-case $O(n)$ running time?". Then tailor your answer accordingly. Be prepared for the interviewer to ask you how you would choose, in practice. (If so, score! That's a question you should be able to hit out of the ballpark.)

D.W.
  • 167,959
  • 22
  • 232
  • 500
0

Hashtable based solution

Not sure why hashtable makes the complexity $\Omega(n^2)$ if $n$ is the number of characters (not words).

If you iterate through every character in the document and as you are iterating, calculate the hashcode of the word, you will have gone through $n$ characters. That is, as soon as a letter is encountered, the word begins, so start computing hash until the word ends (there are some special cases for punctuation but those do not affect the complexity). For every word, once the hash is computed, add it to a hashtable. This is to avoid going over every word twice, i.e. first to iterate through the document to find the words and then to insert them in a hashtable, although the complexity in that case could also be $\Omega(n)$.

Collisions in the hashtable are surely a problem, and depending on how big the original hashtable was and how good the hashing algorithm is, one could approach close to $O(1)$ for insertions and keeping counts and thus $O(n)$ for the algorithm, although at the expense of memory. However, I still cannot appreciate how the worst case can be asserted to be $O(n^2)$ if $n$ is the number of characters.

The assumption is that the hashing algorithm is linear in time in relation to the number of characters.

Radix sort based solution

Alternatively, assuming English, since the length of the words is well-known, I would instead create a grid and apply radix sort which is $O(kN)$ where $k$ would be the maximum length of a word in the English language, and $N$ is the total number of words. Given $n$ is the number of characters in the document, and $k$ is a constant, asymptotically this amounts $O(n)$.

Now count the frequency of each word. Since the words are sorted, we will be comparing each word to its preceding word to see if it's the same one or different. If it's the same, we remove the word and add a count to the previous. If different, just make the count 1 and move on. This requires $2n$ comparisons where $n$ is the number of characters, and thus $O(n)$ in complexity as a whole.

The top few longest words in English are ridiculously long, but then one could cap the word length at a reasonable number (such as 30 or smaller) and truncate words accepting the margin of error that might come with it.

Omer Iqbal
  • 206
  • 1
  • 4