5

I was just wondering if there's any reasonable way to pass authentication cookies from webdriver.Firefox() instance to the spider itself? It would be helpful to perform some webdriver stuff and then go about scraping "business as usual". Something to the effect of:

def __init__(self):
    BaseSpider.__init__(self)
    self.selenium = webdriver.Firefox()

def __del__(self):
    self.selenium.quit()
    print self.verificationErrors

def parse(self, response):

    # Initialize the webdriver, get login page
    sel = self.selenium
    sel.get(response.url)
    sleep(3)

    ##### Transfer (sel) cookies to (self) and crawl normally??? #####
    ...
    ...
dru
  • 698
  • 1
  • 9
  • 11
  • Should be possible, I have same issue but working with PHP curl and Selenium. The bigger hassle to deal with is converting the cookie(s) returned by Selenium into format usable by the other tool (scrapy). In the case of curl, it doesn't use same format as Selenium, so you can't just simply pass over the cookie and use directly. – David Jun 14 '12 at 23:28
  • to get cookie from webdriver, i believe it would be: driver.manage.get_cookies(), store that to variable, convert format if needed, then pass as input to the other tool. – David Jun 14 '12 at 23:32

4 Answers4

2

Transfer Cookies from Selenium to Scrapy Spider

Scrapying File

from selenium import webdriver
driver=webdriver.Firefox()  
data=driver.get_cookies()
# write to temp file        
with open('cookie.json', 'w') as outputfile:
    json.dump(data, outputfile)
    driver.close()
    outputfile.close()

....

Spider

import os
if os.stat("cookie.json").st_size > 2:
    with open('./cookie.json', 'r') as inputfile:
        self.cookie = json.load(inputfile)
    inputfile.close()
George
  • 6,006
  • 6
  • 48
  • 68
0

This works with chrome driver but not Firefox (Tested OK)
refer https://christopher.su/2015/selenium-chromedriver-ubuntu/ for installation.

import scrapy
from scrapy.spiders.init import InitSpider
from scrapy.http import Request
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pickle


class HybridSpider(InitSpider):
    name = 'hybrid'
    def init_request(self):
        driver = webdriver.Chrome()`

        driver.get('https://example.com')
        driver.find_element_by_id('js-login').click()
        driver.find_element_by_id('email').send_keys('mymail@example.net')
        driver.find_element_by_id('password').send_keys('mypasssword',Keys.ENTER)

        pickle.dump( driver.get_cookies() , open(os.getenv("HOME")+"/my_cookies","wb"))
        cookies = pickle.load(open(os.getenv("HOME")+"/my_cookies", "rb"))
        FH = open(os.getenv("HOME")+"/my_urls", 'r')

        for url in FH.readlines():
            pass
            yield Request(url,cookies=cookies,callback=self.parse)


    def parse(self, response):
        pass

Haven't tried directly passing the cookies like

yield Request(url,cookies=driver.get_cookies(),callback=self.parse)

Might work too..

Hemanth Gowda
  • 604
  • 4
  • 16
0
driver = webdriver.Chrome()

Then perform the login or interact with the page through the browser. Now when using the crawler in scrapy, set the cookies parameter:

request = Request(URL, cookies=driver.get_cookies(), callback=self.mycallback)
sɐunıɔןɐqɐp
  • 3,332
  • 15
  • 36
  • 40
Deskom88
  • 39
  • 6
0

You can try to override BaseSpider.start_requests method to attach to starting requests needed cookies using scrapy.http.cookies.CookieJar.

See also: Scrapy - how to manage cookies/sessions

Community
  • 1
  • 1
warvariuc
  • 57,116
  • 41
  • 173
  • 227