I am working with "medium large" data of around 200GB. The data are long form log files, where there are several thousand logs for each "entity". The entities are actually flights and each log entry occurs at a different time stamp. Temporal order matters (I need to sort on timestamp).
to prepare the data, I need to perform several operations on each flight -- order them by time, select the last 300 observations, rescale by mean and standard deviation. Nothing fancy.
Using a small sample of the data, I wrote a simple R program that split the dataframe by entity ID and applied a short function to each piece.
Now I need to scale up the code. I considered using Spark's groupBy and user defined functions, but I know that Spark DataFrames are inefficient when you need to index the rows and relate one row to its neighbour under the indexing. And in any case, 200GB of data do not require a massively distributed solution.
So I would like to know what are the best tools to scale up a moderate sized data analysis, when I need to split a large file into smaller dataframes and apply a function to each, where the order of the rows needs to be controlled.