0

There are many simple plagiarism detection algorithms that work on search engines like google etc. I want to have a index of corpus of the whole internet to serve as a back-end database for my plagiarism detection software. What should be the approach to build such database? Are there any opensource or collaborated live repositories?

somewhere i read instead of having local database of the entire internet, one can index and use it for faster search.

I know Elastic Search seems to be usable. Anyone has tried before?

Shiva
  • 9
  • 2

1 Answers1

1

I want to have a local database of corpus of the whole internet

Are you Google? If not storage might be an issue ;)

The PAN series have run various tasks related to plagiarism detection in the past: https://pan.webis.de/tasks.html#task-originality. I think they provide annotated datasets and they used to provide a live search engine.

Erwan
  • 26,519
  • 3
  • 16
  • 39