1

I am trying to classify the data in a database columns. DB has about 90 million entries.The main goal is to find different patterns in columns to leverage it for create look alike data. The data columns has entries which can easily look like patterns as :

 CUST1212,CUST1213,CUST1214...,  number sequential after CUST
 CODE1213,CODE1242,CODE1289...   random numbers after code
 CUST1213,MUST8324,FIFA12313...  
 009-123-2123,003-124-2314,006-213-5322,
 INTER122,INTER222,INTER322...,  number increasing in batch of 100
 OM|ON|TO, IO|OI|UH...,  pipe delimited 
 some 6 and 7 digits number.

The issue is there are so many patterns and looks like creating patterns manually based on data is out of option.

I would really appreciate if one can point to how can I build a collection of patterns preferably by usage frequency.

What I started is 1. Segregate the data as per their length as patterns usually generate same string length, sorted them and stored them in different files. 2. Then classifying them further as per Alphabet prefix or numeric prefix. 3. For alpha-prefix segregate further on common alphabets at start.

Now seems I am lost and sincerely hoping if one can point out to me what are my options and probably best to took.

I believe if end of computation if get something like patterns in data and their frequency e.g. CUST[0-9]{4} -> 10k times, [a-zA-Z]{4} -> 50k times [0-9]{3}-[0-9]{3}-[0-9]{4} 10K times etc....., it would be great.

Thanks

Raphael
  • 73,212
  • 30
  • 182
  • 400
rohit
  • 121
  • 2

1 Answers1

1

The general approach is likely to be:

  1. Identify a set of candidate patterns that might appear.

  2. For each candidate pattern, check for all instances of that pattern.

I'm not aware of any simple procedure that avoids the need for step 1. The problem is that there are many possible patterns that might appear (e.g., "consists solely of Latin-8 characters"; "length is always at most 100") but most of them won't be meaningful or useful in your domain. Only you know what will be meaningful/useful. So, you'll have to use your domain knowledge to identify which potential patterns are useful -- I don't know of any algorithm that can do that for you.

D.W.
  • 167,959
  • 22
  • 232
  • 500