I'm trying to filter a Dask dataframe with groupby.
df = df.set_index('ngram');
sizes = df.groupby('ngram').size();
df = df[sizes > 15];
However, df.head(15) throws the error ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.. The divisions on sizes are not known:
>>> df.known_divisions
True
>>> sizes.known_divisions
False
A workaround is to do sizes.compute() or .to_csv(...) and then read it back to Dask with dd.from_pandas or dd.read_csv. Then sizes.known_divisions would return True. That's a notable inconvenience.
How else can this be solved? Am I using Dask wrong?
Note: there's an unanswered dublicate here.