I created an ec2 instance for web scraping purposes. However, I can't scrape any sites with selenium because I get below error :
"selenium.common.exceptions.TimeoutException: Message: connection refused" error.
I think this is to do with the security group settings blocking off websites. So I created a new security group according to this . However, upon doing this, I am not able to ssh into the EC2 instance anymore.
What configuration do I need for my EC2 instance to be able to scrape websites?
I will assume that you are using Selenium on an Amazon EC2 instance.
Your Inbound security group settings are irrelevant for Selenium, but presumably you will want to login to the instance. Thus, your Inbound security group should permit port 22 (for Linux) or port 3389 (for Windows RDP).
To permit the Selenium app on the instance to access the Internet, you could use the default "Allow all" setting for the Outbound security group: All Traffic, all ports, Destination = 0.0.0.0/0
It is possible that the websites you are attempting to scrape have blocked the IP address range of Amazon EC2 instances. (Always operate according to a website's conditions of use!) You can test this by connecting to the Amazon EC2 instance and then trying to retrieve some websites, such as:
curl www.google.com
The contents of an HTML page should be returned.
Then, try it with one of the websites you are intending to scrape to verify that the instance can access that site.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.