5

I have a database of 3190 instances of DNA consisting of 60 sequential DNA nucleotide positions classified according to 3 types: EI, IE, Other.

I want to formulate a supervised classifier.

My present approach is to formulate a 2nd order Markov Transition Matrix for each instance and apply the resulting data to a Neural Network.

How best to approach this classification problem, given that the Sequence of the data should be relevant? Is there a better approach than the one I came up with?

akellyirl
  • 723
  • 1
  • 6
  • 9

1 Answers1

3

One way would be to create 20 features (each feature representing a codon). In this way, you would have a dataset with 3190 instances and 20 categorical features. There is no need to treat the sequence as a Markov chain.

Once the dataset has been featurized as suggested above, any supervised classifier can work well. I would suggest using a gradient boosting machine as it might be better suited to handle categorical features.

Nitesh
  • 1,625
  • 1
  • 12
  • 22