4

I have a large data-frame (155257 x 21 to be specific) with only a few missing values. Say, some 2.16% of the values need to be imputed. The values are floating point numbers.

I'd like to use a method that is much faster than it is accurate, because of the size of the data-set and the fact that I don't have much to lose in a speed-accuracy tradeoff.

Running missForest() takes several hours while Hmisc's impute() function gives unsatisfactory results.

What functions in R might be useful in such (or similar) case?


<code>mice_plot</code> output

neural-nut
  • 1,803
  • 3
  • 18
  • 28

1 Answers1

1

Take a look at the h20 package https://cran.r-project.org/web/packages/h2o/h2o.pdf.

Everything is designed with parallelization in mind. I've had great success with many of their implementations, in R and Scala.

If you have to do it in R and are going for pure speed I doubt you'll find something faster.