I'm trying to use regular expressions to find a substring in a string of DNA. This substring has ambiguous bases, that like ATCGR, where R could be A or G. Also, the script must allow x number of mismatches. So this is my code
import regex
s = 'ACTGCTGAGTCGT'
regex.findall(r"T[AG]T"+'{e<=1}', s, overlapped=True)
So, with one mismatch I would expect 3 substrings AC**TGC**TGAGTCGT and ACTGC**TGA**GTCGT and ACTGCTGAGT**CGT**. The expected result should be like this:
['TGC', 'TGA', 'AGT', 'CGT']
But the output is
['TGC', 'TGA']
Even using re.findall, the code doesn't recognize the last substring. On the other hand, if the code is setting to allow 2 mismatches with {e<=2}, the output is
['TGC', 'TGA']
Is there another way to get all the substrings?