I want to crawl a website which has a strong security protocol and want to crawl data as fast as possible. Thus I thought I need a multi-login-cookie, multi-user-agent, and multi-proxy crawler.
I have tens of usernames and passwords and I can login using each one and get all the cookies. To hide the identity of my crawler I thought I should also replace the user-agent setting and my IP. I have found many user agents and proxies.
I learned that the cookie is needed each time I send a request to the server, and that cookie should be of the same identity and contain the information of the previous request and the corresponding response. I've gained the knowledge of how to pass it through requests without logging in from this answer. And I know two ways to login in, one outside the scrapy(by passing the cookie to the cookiesmiddleware in the middleware.py file:
from cookies import cookies # script written to login some accounts and return the cookies
import random
class CookiesMiddleware(object):
def process_request(self, request, spider):
cookie = random.choice(cookies)
request.cookies = cookie
) and another inside it.
What's more in the middleware.py file I passed the user agents randomly in the same as for cookies to the scrapy requests.
My question is: if I pass the cookies randomly as aforementioned, will one spider get the same cookie each time it sends a request? If not the server side will detect me as a bot and block me. What's worse, the same applies to the user-agents and proxies. How to bond each trinity(login cookie, user-agent and proxy) starting from the login, extending the aforesaid answer both in the horizontal and vertical dimension?
To be more precise, should I pass the login cookie in the form of {cookies= user1_cookie} or { meta={'cookiejar': user1_cookie},? And should I pass the user agent and proxy in the meta parameter?
Thanks. Please kindly point me in the right direction, and any suggestions will be highly received and appreciated.