I am trying to classify the data in a database columns. DB has about 90 million entries.The main goal is to find different patterns in columns to leverage it for create look alike data. The data columns has entries which can easily look like patterns as :
CUST1212,CUST1213,CUST1214..., number sequential after CUST
CODE1213,CODE1242,CODE1289... random numbers after code
CUST1213,MUST8324,FIFA12313...
009-123-2123,003-124-2314,006-213-5322,
INTER122,INTER222,INTER322..., number increasing in batch of 100
OM|ON|TO, IO|OI|UH..., pipe delimited
some 6 and 7 digits number.
The issue is there are so many patterns and looks like creating patterns manually based on data is out of option.
I would really appreciate if one can point to how can I build a collection of patterns preferably by usage frequency.
What I started is 1. Segregate the data as per their length as patterns usually generate same string length, sorted them and stored them in different files. 2. Then classifying them further as per Alphabet prefix or numeric prefix. 3. For alpha-prefix segregate further on common alphabets at start.
Now seems I am lost and sincerely hoping if one can point out to me what are my options and probably best to took.
I believe if end of computation if get something like patterns in data and their frequency e.g. CUST[0-9]{4} -> 10k times, [a-zA-Z]{4} -> 50k times [0-9]{3}-[0-9]{3}-[0-9]{4} 10K times etc....., it would be great.
Thanks