6

I would like to summarize (as in R) the contents of a CSV (possibly after loading it, or storing it somewhere, that's not a problem). The summary should contain the quartiles, mean, median, min and max of the data in a CSV file for each numeric (integer or real numbers) dimension. The standard deviation would be cool as well.

I would also like to generate some plots to visualize the data, for example 3 plots for the 3 pairs of variables that are more correlated (correlation coefficient) and 3 plots for the 3 pairs of variables that are least correlated.

R requires only a few lines to implement this. Are there any libraries (or tools) that would allow a similarly simple (and efficient if possible) implementation in Java or Scala?

PD: This is a specific use case for a previous (too broad) question.

Trylks
  • 178
  • 8

3 Answers3

2

Checkout Breeze and apache commons math for the maths, and ScalaLab for some nice examples of how to plot things in Scala.

I've managed to get an environment setup where this would just be a couple of lines. I dont actually use ScalaLab, rather borrow some of its code, I use Intellij worksheets instead.

samthebest
  • 269
  • 1
  • 3
1

If your data is numeric, try loading it into ELKI (Java). With the NullAlgorithm it will give you scatterplots, histograms and parallel coordinate plots. It's fast in reading the data; only the current Apache Batik-based visualization is slooow because it's using SVG. :-( I'm mostly using it "headless".

It also has classes for various statistics (including higher order moments on data streams), but I havn't seen them in the default UI yet.

Has QUIT--Anony-Mousse
  • 8,134
  • 1
  • 16
  • 31
0

I'd have a closer look at one of Apache Spark's modules: MLlib.