2

I want to scrape some websites with selenium.

I successfully run the program with selenium on ec2 but, as we know, ec2 is associated with a specific IP. So I want to integrate Amazon API Gateway rotating proxy with my python selenium script.

I read this SO question that integrates AWSGateway with the python requests module. How can I have a similar integration with selenium too??

Any guide would be appreciated.

Praful Bagai
  • 16,684
  • 50
  • 136
  • 267
  • Just to be clear: your need is to run your selenium script each time with a different IP so the remote scraped site is less likely to block your ip? – Mattia Galati Aug 30 '22 at 10:33
  • That's right. FYI - I'm aware of how to pass public proxies into selenium webdriver. But the downside of using public proxies is slowness & uptime. Hence I want to use AWS API-gateway to accomplish the same. – Praful Bagai Aug 30 '22 at 12:23
  • 2
    First of all, keep in mind that intensive scraping could stress target website and could lead to downtimes. If a website uses IP throttling or blocking I suppose they don't want to be scraped so frequently or at all and you should respect their work. Before proceed please think about it. That said, I suggest a different approach: 1) create an AMI of your EC2, 2) each time you need to run the script, start a SPOT instance base on that AMI, 3) Destroy the spot instance when the script has finished. In this way you will have a different public IP for each instance. – Mattia Galati Aug 31 '22 at 08:27

1 Answers1

1

You can try following Utility

  • https://github.com/teticio/lambda-scraper

              import json
              import boto3
    
              num_proxies = 10
              url = 'https://ipinfo.io/ip'
    
              lambda_client = boto3.client('lambda')
              round_robin = 0
              while True:
                  response = json.loads(
                      lambda_client.invoke(FunctionName=f'proxy-{round_robin}',
                                           InvocationType='RequestResponse',
                                           Payload=json.dumps({"url":
                                                               url}))['Payload'].read())
                  print(f'{response["statusCode"]} {response["body"]}')
                  round_robin = (round_robin + 1) % num_proxies
    

More detail how to use it you can find on following link - https://medium.com/nerd-for-tech/web-scraping-with-a-proxy-pool-the-cheap-way-4c7d6fc9f859

This explain you how to use selenium with lambda using layer

vaquar khan
  • 10,864
  • 5
  • 72
  • 96