corpus development for plagiarism detection

Question

There are many simple plagiarism detection algorithms that work on search engines like google etc. I want to have a index of corpus of the whole internet to serve as a back-end database for my plagiarism detection software. What should be the approach to build such database? Are there any opensource or collaborated live repositories?

somewhere i read instead of having local database of the entire internet, one can index and use it for faster search.

I know Elastic Search seems to be usable. Anyone has tried before?

score 1 · Answer 1 · answered Jul 01 '19 at 11:57

I want to have a local database of corpus of the whole internet

Are you Google? If not storage might be an issue ;)

The PAN series have run various tasks related to plagiarism detection in the past: https://pan.webis.de/tasks.html#task-originality. I think they provide annotated datasets and they used to provide a live search engine.

corpus development for plagiarism detection

1 Answers1