How do I learn experimental methodology? When is it relevant?

Question

I just graduated in Computer Science, with a very theoretical background but without any kind of Data Science or Artificial Intelligence experience, and I working on my own to discover those two fields. More precisely, I try to work on a toy subspace clustering k-means-like algorithm, and I think I successfully learned basic optimization techniques.

So now I have some ad-hoc subspace clustering algorithm, and... I'm stuck. How do I validate it? During my studies, I studied a lot of formal proving or model checking techniques, but I feel they are totally irrelevant here. I read several AI papers, and it seems that there is a strong tradition of experimental validation in Clustering, with a lot of validity measures. My problem is that they mean nothing to me. I don't understand what they prove, or even why they are relevant. I would be very interested in a general course about the experimental validation methodology (if it actually is a thing!), if possible with theoretical justification.

Moreover, I don't really know what are the good properties people are looking for in clustering algorithms. What general theorems should I aim for? I understand this may be a question too related to what I am working on - a generic answer for k-means-like algorithms would be enough.

score 2 · Answer 1 · answered Oct 29 '15 at 12:15

On the metrics. A common technique is to make simulated data with known labels. Then see if your algorithm can reproduce the clustering; if it can't there's a problem, if it can there may still be problems but you're off the starting blocks. Typical measures would include ROC and AUC (essentially how many times does the algo correctly place object i and j in the same class, balanced against false positives).

Another tactic is to use well known real datasets to benchmark against. E.g. the classic iris dataset. See how your algo performs in terms of speed and accuracy versus other algos.

As to what people are looking for this could mean a multitude of things. You might do well to consider where your algo is suited, find an application in that area, and find out whether it performs for that purpose.

score 1 · Answer 2 · answered Nov 02 '15 at 05:38

Data Scientist differs from Data Engineer as it is "science vs engineering".

Scientific methods (such as statistics, validation, experimental design, etc) is what separates data scientists from data engineers. While the program skills are still essential, it is the "tool" for solving problem instead of "problem solving" itself - we use programming tools to gather, process and analyze data but it is the scientific methods that solidify the results.

I would recommend you to learn more about statistics. I was a computer science graduate as well and I personally learn statistics from Udacity (and other sources afterwards). I feel it's a nice place to start.

How do I learn experimental methodology? When is it relevant?

2 Answers2