2

I am trying to build a random forest model in R (RStudio). My training dataset has around 2 million rows and 38 variables. When I tested 5000 rows from this dataset I was able to build the random forest but when I run on the whole dataset I get the following error:

Error in randomForest.default(m, y, ...) : long vectors (argument 24) are not supported in .C

Can anyone please suggest, apart from removing the number of rows, how can I fix this? Can I run multiple random forests and then combine them into one? If yes, can someone please recommend how can I try this?

Many thanks in advance.

T.H.
  • 187
  • 5

2 Answers2

1

There are something like 30 random forest packages in R. "randomForest" is one of the first implementations and so is well known, but it's not great for large datasets. "ranger" is a good R package; it's fast, handles large data, and has parameter tuning searches. It's easier to use with package "parsnip".

library(ranger)
library(parsnip)

Build model:

forest_model <- rand_forest(mtry = 12, trees = 1000 ) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("regression") %>% 
  fit(dependent_variable ~ . , data = training_data)

Make guesses:

  predict(forest_model, new_data)
Tori Oblad
  • 111
  • 1
0

I'm very surprised you're running into this error with only 2mil rows and 38 variables. I would encourage you to have a go doing this in python using SKLearn and see if you run into the same issue. More generally though, if you have too much data (imagine you had 200 million rows), the right thing to do would not be to build multiple forests, but to build each tree with a smaller fraction of the data.

I don't have a suggestion for where to find such an implementation (there might be some way of combining sklearn and dask, I've seen such things for XGBoost), but the general principle would be that you wouldn't hold all of your data in memory, you'd read a subset of it from disk, train a tree, read a new subset, etc.

gazza89
  • 275
  • 1
  • 7