0

Background

I'm analyzing a relatively large text-based Arabic dataset using Python (50,000 - 70,000 text files; total size ~5GB).

I want to segment, stem, and POS tag the dataset. I am aware of two Python libraries that can do these 3 tasks: farasapy and camel parser.

The problem

However, both of these Arabic-compatible Python libraries are rather slow for my purposes.

Based on a preliminary analysis of a subset of the data (n = 1000 text files), it is estimated that the camel parser library would take around 29 days to process the full dataset.

Question

Is there a more efficient method or library to do the 3 target tasks for Arabic texts?

I would appreciate your suggestions.

Alaa
  • 1

1 Answers1

0

I will summarize what I did to enhance the performance of Camel Parser 2.0 and what worked:

  1. used a virtual machine with a 64-core CPU: no significant increase in speed
  2. used 64 cores while implementing several parallelization libraries (multiprocessing.Pool, multiprocess.Pool, concurrent.futures.ProcessPoolExecutor, mpire.WorkerPool): no significant increase in speed
  3. used a powerful GPU: significant increase in speed (2x)
  4. used a powerful GPU + parallelization: no significant increase in speed

Increasing GPU is key for the camel parser library.

Alaa
  • 1