Questions tagged [crawling]
13 questions
28
votes
7 answers
Publicly available social network datasets/APIs
As an extension to our great list of publicly available datasets, I'd like to know if there is any list of publicly available social network datasets/crawling APIs. It would be very nice if alongside with a link to the dataset/API, characteristics…
Rubens
- 4,117
- 5
- 25
- 42
11
votes
5 answers
LinkedIn web scraping
I recently discovered a new R package for connecting to the LinkedIn API. Unfortunately the LinkedIn API seems pretty limited to begin with; for example, you can only get basic data on companies, and this is detached from data on individuals. I'd…
christopherlovell
- 480
- 1
- 5
- 18
8
votes
5 answers
How to scrape a website with a searchbar
How do I scrape a website that basically looks like google with just a giant searchbar in the middle of the screen. From it you can search after various companies and their stats.
I have a list of 1000 companies I want to get information about. I…
Ceylon
- 141
- 1
- 1
- 4
4
votes
2 answers
Web Scraping - a scientific database
I am searching a scientific database for abstracts of papers containing the words project management. Here is the link:
For getting abstracts, I need to click on any paper and open a new page. How can I do that for 68 papers? I program in R and…
Hamideh
- 942
- 2
- 12
- 22
3
votes
4 answers
Format for storing textual data
For an upcoming project, I'm mining textual posts from an online forum, using Scrapy. What is the best way to store this text data? I'm thinking of simply exporting it into a JSON file, but is there a better format? Or does it not matter?
cakesofwrath
- 31
- 1
- 2
3
votes
0 answers
How can I find company descriptions for a long list of companies?
I'm going to train an ml algorithm to qualify potential sales leads based upon company descriptions. To do this, I need to find the company descriptions programatically.
E.g. given a long list of company names, how can I find descriptions for these…
Per Borgen
- 31
- 1
2
votes
3 answers
Crawling customer reviews from Amazon
I want to know if there is any way that I can crawl customer reviews for particular products from amazon without being blocked. At the moment, my crawler is blocked after a few times. Any idea will be appreciated.
bensw
- 189
- 1
- 4
2
votes
0 answers
Is there a way to scrape tweets in realtime from a list of specified users?
I am trying to build a scraper that will run continuously and save the tweets from a list of users instantaneously or within seconds of the user tweeting it. It could save the tweet details to a continuously updated csv file.
niusoski
- 21
- 2
1
vote
1 answer
Publicly available news APIs/datasets?
In addition to our list of publicly available datasets, I'd like to know if there is any list of publicly available news datasets/crawling APIs. It would be very nice if alongside with a link to the dataset/API, characteristics of the data available…
stevec
- 211
- 1
- 7
1
vote
2 answers
Data extraction using crawlers
I have a rather simple data scraping task, but my knowledge of web scraping is limited. I have a excel file containing the names of 500 cities in a column, and I'd like to find their distance from a fixed city, say Montreal. I have found this…
Jay
- 13
- 3
0
votes
1 answer
corpus development for plagiarism detection
There are many simple plagiarism detection algorithms that work on search engines like google etc. I want to have a index of corpus of the whole internet to serve as a back-end database for my plagiarism detection software.
What should be the…
Shiva
- 9
- 2
0
votes
1 answer
Is there a ubiquitous web crawler that can generate a good language-specific dataset for training a transformer?
It seems like a lot of noteworthy AI tools are being trained on datasets generated by web crawlers rather than human-edited, human-compiled corpora (Facebook Translate, GPT-3). In general, it sounds more ideal to have an automatic and universal way…
Julius Hamilton
- 131
- 4
-3
votes
4 answers
Looking for Web scraping tool for unstructured data
I want to scrape some data from a website.
I have used import.io but still not much satisfied.. can any of you suggest about it.. whats the best tool to get the unstructured data from web
cap
- 432
- 3
- 9