awsProxy Rotating with Amazon API Gateway & Selenium

Question

I want to scrape some websites with selenium.

I successfully run the program with selenium on ec2 but, as we know, ec2 is associated with a specific IP. So I want to integrate Amazon API Gateway rotating proxy with my python selenium script.

I read this SO question that integrates AWSGateway with the python requests module. How can I have a similar integration with selenium too??

Any guide would be appreciated.

Just to be clear: your need is to run your selenium script each time with a different IP so the remote scraped site is less likely to block your ip? — Mattia Galati, Aug 30 '22 at 10:33
That's right. FYI - I'm aware of how to pass public proxies into selenium webdriver. But the downside of using public proxies is slowness & uptime. Hence I want to use AWS API-gateway to accomplish the same. — Praful Bagai, Aug 30 '22 at 12:23
First of all, keep in mind that intensive scraping could stress target website and could lead to downtimes. If a website uses IP throttling or blocking I suppose they don't want to be scraped so frequently or at all and you should respect their work. Before proceed please think about it. That said, I suggest a different approach: 1) create an AMI of your EC2, 2) each time you need to run the script, start a SPOT instance base on that AMI, 3) Destroy the spot instance when the script has finished. In this way you will have a different public IP for each instance. — Mattia Galati, Aug 31 '22 at 08:27

score 1 · Answer 1 · answered Sep 04 '22 at 23:43

You can try following Utility

https://github.com/teticio/lambda-scraper

          import json
          import boto3

          num_proxies = 10
          url = 'https://ipinfo.io/ip'

          lambda_client = boto3.client('lambda')
          round_robin = 0
          while True:
              response = json.loads(
                  lambda_client.invoke(FunctionName=f'proxy-{round_robin}',
                                       InvocationType='RequestResponse',
                                       Payload=json.dumps({"url":
                                                           url}))['Payload'].read())
              print(f'{response["statusCode"]} {response["body"]}')
              round_robin = (round_robin + 1) % num_proxies

More detail how to use it you can find on following link - https://medium.com/nerd-for-tech/web-scraping-with-a-proxy-pool-the-cheap-way-4c7d6fc9f859

This explain you how to use selenium with lambda using layer

FYI - Lambda uses the same ip. – Praful Bagai Sep 06 '22 at 08:41 — Praful Bagai, Sep 06 '22 at 08:41

awsProxy Rotating with Amazon API Gateway & Selenium

1 Answers1