Background
I'm analyzing a relatively large text-based Arabic dataset using Python (50,000 - 70,000 text files; total size ~5GB).
I want to segment, stem, and POS tag the dataset. I am aware of two Python libraries that can do these 3 tasks: farasapy and camel parser.
The problem
However, both of these Arabic-compatible Python libraries are rather slow for my purposes.
Based on a preliminary analysis of a subset of the data (n = 1000 text files), it is estimated that the camel parser library would take around 29 days to process the full dataset.
Question
Is there a more efficient method or library to do the 3 target tasks for Arabic texts?
I would appreciate your suggestions.