need help designing for search algorithm in a more efficient way

Question

I have a problem that involves biology area. Right now I have 4 VERY LARGE files(each with 0.1 billion lines), but the structure is rather simple, each line of these files has only 2 fields, both stands for a type of gene.

My goal is: design an efficient algorithm that can achieves the following: Find a circle within the contents of these 4 files. The circle is defined as:

field #1 in a line in file 1 == field #1 in a line in file 2 and
field #2 in a line in file 2 == field #1 in a line in file 3 and
field #2 in a line in file 3 == field #1 in a line in file 4 and
field #2 in a line in file 4 == field #2 in a line in file 1

I cannot think of a decent way to solve this, so I just wrote a brute-force-stupid-4-layer-nested loop for now. I'm thinking about sorting them as alphabetical order, even if that might help a little, but then it's also obvious that the computer memory would not allow me to load everything at once. Can anybody tell me a good way to solve this problem in a both time and space efficient way? Thanks!!

What distribution do the fields have? Are they unique? Or can you have multiple lines with the same set of fields? Or to put this another way, does field 1 in file 1 match field 1 in multiple other lines? — David Harris, Sep 09 '11 at 04:39

score 1 · Accepted Answer · answered Sep 09 '11 at 04:40

First of all, I note that you can sort a file without holding it memory all at once, and that most operating systems have some program that does this, often called just "sort". Usually you can get it to sort on a field within a file, but if not you can rewrite each line to get it to sort the way you want.

Given this, you can connect two files by sorting them so that the first is sorted on field #1 and the second on field #2. You can then create one record for each match, combining all the fields, and only holding in memory a chunk from each file where all the fields you have sorted on have the same value. This will allow you to connect the result with another file - four such connections should solve your problem.

Depending on your data, the time it takes to solve your problem may depend on the order in which you make the connections. One rather naive way to make use of this is, at each stage, to take a small random sample from each file, and use this to see how many results will follow from each possible connection, and choose the connection that produces the fewest results. One way to take a random sample of N items from a large file is to take the first N lines in the file and then, when you have read in m lines so far, read the next line, and then with probability N/(m + 1) exchange one of the N lines held for it, else throw it away. Keep on until you have read through the whole file.

score 0 · Answer 2 · edited May 23 '17 at 11:51

Here is one algorithm:

Select an appropriate lookup structure: If field#1 is an integer, Use bit-fields or an dictionary (or a set) if its an string; Use the a lookup structure for each file, i.e 4 in your case
Initialization phase: For each file: parse the file line by line and set the appropriate bit in bit-field or add the field to the dictionary in the corresponding lookup structure for the file.
After initializing the lookup structure above, check the condition in your question.

The complexity of this depends on the lookup structure implementation. For bit fields, it will be O(1) and for set or dictionary, it will be O(lg(n)), since they are usually implemented as a Balanced Search Tree. The complete complexity will be O(n) or O(n lg(n)); You solution in the question has complexity of O(n^4)

You can get the code and solution for bit fields from here

HTH

grdvnl · Answer 3 · 2011-09-09T04:59:21.410

0

Here is one approach:

We will use the notation Fxy where x=field number , y=file_no

Sort each of the 4 files on the first fields.

For each field F11, find a match in file 2. This will be linear. Save these matches with all four fields to a new file. Now, use this file and use the corresponding field in this file and get all the matches from file3. Continue for file4 and back to file1.

In this way, as you progress to each new file, you are dealing with lesser number of lines. And since you have sorted the files, search in linear and can be done by reading from disk.

Here the complexity in O(n log n) for sorting, and O(m log n) for lookup, assuming m << n.

edited Sep 09 '11 at 04:59

answered Sep 09 '11 at 04:46

grdvnl

636
6
9

2

Searching a sorted array/file is *not* linear. The time for a single search (via a binary search) on a file with `N` elements is `O(log(N))`. The complexity for `M` searches is therefore `O(M*log(N))`, not `O(M)`. – Darren Engwirda Sep 09 '11 at 04:51
@Darren, fixed the response. I mixed up the complexity with comparing 2 sorted lists. – grdvnl Sep 09 '11 at 04:58

score 0 · Answer 4 · answered Sep 09 '11 at 05:27

It's a bit easier to explain if your File 1 is the other way around (so each second element points to a first element in the next file).

Start with File 1, copy it to a new file writing each A, B pair as B, A, 'REV'
Append the contents of File 2 to it writing each A, B pair as A, B, 'FWD'
Sort the file
Process the file in chunks with the same initial value
- Within that chunk group the lines into REV's and FWD's
- Take the cartesian product of the revs and the fwds (nested loop)
- Write a line with reverse(fwd) concat (rev) excluding the repeated token
- e.g. B, A, 'REV' and B, C, 'FWD' -> C, B, A, 'REV'
Append the next file to this new output file (adding 'FWD' to each line)
Repeat from step 3

In essence you are building up a chain in reverse order and using a file-based sort algorithm to put sequences together that can be combined.

Of course it would be even easier to just read these files into a database and let it do the work ...

need help designing for search algorithm in a more efficient way

4 Answers4

Linked