How to tackle different sample size in the training set in SVM

Question

I have to train a SVM for a classification problem. I have some strings that are the paths in a deterministic finite automata (DFA). If the alphabet is -01- then possible strings are 011101110 or 0110 for example. The purpose of classifier (SVM) is the correct prediction (label) of unseen strings like accepting or rejecting(label 1 or label -1, binary classification). The problem is that the strings have different lenghts. How can I tackle this problem?

D.W. · Answer 1 · 2017-04-26T21:27:31.037

A SVM classifier requires a fixed-length feature vector, i.e., all feature vectors must have the same length. There are multiple solutions:

Pad out the strings to fixed length.
Choose a different set of features, so that there is a fixed number of features.
Pick a fixed number $k$, and look at windows of length $k$ (i.e., substrings of length $k$). Classify each window of length $k$ using your SVM, then combine those results somehow (e.g., majority vote).
Use a different classifier, such as a recurrent neural network (e.g., LSTM). See also https://datascience.stackexchange.com/q/16115/8560 for more possibilities.

It's hard to say what will be most appropriate for your particular situation, without knowing more about your learning task.

Based on your subsequent comments, it sounds like you want to learn a DFA. There's lots written on learning DFAs, using Angluin's algorithm, SAT solvers, or other methods. Follow the link above for some entry points into the literature on that. I don't think a SVM is the right tool for that job -- this sounds like an XY problem.

How to tackle different sample size in the training set in SVM

1 Answers1