0

I want to crawl a website which has a strong security protocol and want to crawl data as fast as possible. Thus I thought I need a multi-login-cookie, multi-user-agent, and multi-proxy crawler.

I have tens of usernames and passwords and I can login using each one and get all the cookies. To hide the identity of my crawler I thought I should also replace the user-agent setting and my IP. I have found many user agents and proxies.

I learned that the cookie is needed each time I send a request to the server, and that cookie should be of the same identity and contain the information of the previous request and the corresponding response. I've gained the knowledge of how to pass it through requests without logging in from this answer. And I know two ways to login in, one outside the scrapy(by passing the cookie to the cookiesmiddleware in the middleware.py file:

from cookies import cookies # script written to login some accounts and return the cookies
import random

class CookiesMiddleware(object):
    def process_request(self, request, spider):
        cookie = random.choice(cookies)
        request.cookies = cookie

) and another inside it.

What's more in the middleware.py file I passed the user agents randomly in the same as for cookies to the scrapy requests.

My question is: if I pass the cookies randomly as aforementioned, will one spider get the same cookie each time it sends a request? If not the server side will detect me as a bot and block me. What's worse, the same applies to the user-agents and proxies. How to bond each trinity(login cookie, user-agent and proxy) starting from the login, extending the aforesaid answer both in the horizontal and vertical dimension?

To be more precise, should I pass the login cookie in the form of {cookies= user1_cookie} or { meta={'cookiejar': user1_cookie},? And should I pass the user agent and proxy in the meta parameter?

Thanks. Please kindly point me in the right direction, and any suggestions will be highly received and appreciated.

Community
  • 1
  • 1
Lerner Zhang
  • 6,184
  • 2
  • 49
  • 66

1 Answers1

0

Seems like you are looking for cookiejar. It will allow you to store multiple cookie sessions in single spider session.

Using middleware for random cookies is a bad idea since cookies in most cases store your whole browsing sessions.

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
  • The doc seems a little bit too sketchy to me, but I have not found more tutorials or examples about the cookiejar. – Lerner Zhang Jul 08 '16 at 08:05
  • So does it mean that cookies stored in my way will be randomly changed for each request and I should employ cookiejar to manage them? – Lerner Zhang Jul 08 '16 at 08:17
  • There's a really good write up and explanation here: https://blog.scrapinghub.com/2016/03/23/scrapy-tips-from-the-pros-march-2016-edition/ However now that I read through your post again I think your issue is more complicated and cookiejar might not be a solution you are looking for if you only have one simultaneous request. – Granitosaurus Jul 08 '16 at 08:24
  • I thought I should add the login cookie to the general cookie and then utilize cookiejar. This may only solve the login cookie and normal cookie problem but not the other two: user agent and proxy. – Lerner Zhang Jul 08 '16 at 08:34