How to filter a very, very large file

Question

I have a very large unsorted file, 1000GB, of ID pairs

ID:ABC123 ID:ABC124
ID:ABC123 ID:ABC124
ID:ABC123 ID:ABA122
ID:ABC124 ID:ABC123
ID:ABC124 ID:ABC126

I would like to filter the file for

1) duplicates

example
ABC123 ABC124
ABC123 ABC124

2) reverse pairs (discard the second occurrence)

example
ABC123 ABC124
ABC124 ABC123

After filtering, the example file above would look like

ID:ABC123 ID:ABC124
ID:ABC123 ID:ABA122
ID:ABC124 ID:ABC126

Currently, my solution is this

var hash;

foreach line in file
      get ID1 
      get ID2

      if exists in hash ID1ID2 or exists in hash ID2ID1
        //next line, as we have already have this in our collection
      else
         add to hash ID1 ID2         
         print the line

which gives me the desired results for smaller lists, but takes up too much memory for larger lists, as I am storing the hash in memory.

I am looking for a solution that will take less memory to implement. Some thoughts I have are

1) save the hash to a file, instead of memory

2) multiple passes over the file

3) sorting and uniquing the file with unix sort -u -k1,2 , but that takes many hours and it also doesn't reduce the file that much. So even if I run that sort, Most of the filesize is retained. when the hash is searched for ID2ID1 , I still have to load a huge file into memory

score 2 · Answer 1 · edited Apr 13 '17 at 12:43

You have a dataset with at least billions of entries, whose size is that of a medium-sized hard disk. This is large. If something takes only “many hours” and not many months, that's as good as you can expect.

Sorting the file is a viable strategy to get rid of the duplicates. You'll also need to canonicalize the pairs. If you stick with a flat file, the most likely strategy is to do a pass of canonicalization where you define an order relation on the IDs and swap the pairs if the larger ID comes first. For the canonicalization step, you can use awk or a custom C program. For the sorting step, use sort, which is (at least on Linux) capable of doing external sorts and is likely to be faster what you'd come up with. Do this in a C locale to make processing faster (based on bytes rather than multibyte characters).

If you want to do lookups, a terabyte-sized flat file is not the right format. This is large even for a database. On the topic of databases with billions of records, read Can MySQL reasonably perform queries on billions of rows? and Which database could handle storage of billions/trillions of records?. A database may or may not be adequate for your task — this depends if the pairs are all you have or if there's other data to consider, and on whether there's structure to the IDs. I think a database would be the right tool if you want to support queries like “list the pairs that contain this ID value”. You should ask database professionals, rather than computer science professionals, whether and which database would be suitable; be sure to explain the structure of your data and what kind of queries you'll want to make. If you decide to use a database, there's no need to remove duplicates or canonicalize entries before importing them, that will happen on the fly.

If you determine that you aren't going to benefit from a database, then you should organize your data in a structured way. The best structure depends on the density of the data, i.e. how many of the potential pairs are present, how many of the potential IDs are present, the distribution of IDs, etc. Here's an example of the kind of structure that might work:

A toplevel directory determined by the first two characters of the first pair.
A second-level directory determined by the next two characters of the first pair.
In these directories, a file per first ID, named with the first ID and containing the list of second IDs.

This is just an example — you need to find a repartition that works. With typical Linux filesystems, aim for about 100–1000 entries per directory (i.e. 100–1000 toplevel directories, 100–1000 subdirectories in each directory, 100–1000 file in the deepest directories).

There's no need to remove duplicates before moving the data into a structured file set like this. You may want to do a pass of canonicalization first (or do it on the fly).

If you're going to make a lot of queries and you expect many of them do come up negative, build a Bloom filter. Define a fast hash function on pairs whose output is $N$ bits, and write an array of $2^N$ bits to a file where the nth bit indicates whether there exists a pair whose hash is $n$.

score 0 · Answer 2 · edited May 23 '17 at 12:37

Use GNU/Unix sort. It is designed exactly for this sort of thing, and is perfect for removing duplicates. (See, e.g., sort -u or sort ... | uniq ....)

You might want to tune the initial buffer size, with parameter -S.

GNU sort already implements algorithms to sort files that are larger than can be fit into main memory, and thus it can handle very large files. The algorithmic technique you are looking for is called "external sorting". There's lots written on the subject. You can do your own research for more on that.

How to filter a very, very large file

2 Answers2