简体   繁体   中英

ec2 instance can't connect to websites

I created an ec2 instance for web scraping purposes. However, I can't scrape any sites with selenium because I get below error :

"selenium.common.exceptions.TimeoutException: Message: connection refused" error.

I think this is to do with the security group settings blocking off websites. So I created a new security group according to this . However, upon doing this, I am not able to ssh into the EC2 instance anymore.

What configuration do I need for my EC2 instance to be able to scrape websites?

For accessing SSH you need to modify the security group as in the image below:

SSH SG 规则

For accessing HTTP(80 port) or HTTPS(443 port) you need to add rules like this: HTTP 或 HTTPS 规则

Check if these two rules are enabled or not. These are all inbound rules.

I will assume that you are using Selenium on an Amazon EC2 instance.

Your Inbound security group settings are irrelevant for Selenium, but presumably you will want to login to the instance. Thus, your Inbound security group should permit port 22 (for Linux) or port 3389 (for Windows RDP).

To permit the Selenium app on the instance to access the Internet, you could use the default "Allow all" setting for the Outbound security group: All Traffic, all ports, Destination = 0.0.0.0/0

It is possible that the websites you are attempting to scrape have blocked the IP address range of Amazon EC2 instances. (Always operate according to a website's conditions of use!) You can test this by connecting to the Amazon EC2 instance and then trying to retrieve some websites, such as:

curl www.google.com

The contents of an HTML page should be returned.

Then, try it with one of the websites you are intending to scrape to verify that the instance can access that site.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM