[英]Can't scrape all of ul tags from a table
I'm trying to scrape all of proxy ips from this site: https://proxy-list.org/english/index.php but i can only get one ip at most here is my code:我正在尝试从该站点抓取所有代理ip:https://proxy-list.org/english/index.php但我只能在这里获得一个 Z957B527BFBAD2E80F58D20683931
from helium import *
from bs4 import BeautifulSoup
url = 'https://proxy-list.org/english/index.php'
browser = start_chrome(url, headless=True)
soup = BeautifulSoup(browser.page_source, 'html.parser')
proxies = soup.find_all('div', {'class':'table'})
for ips in proxies:
print(ips.find('li', {'class':'proxy'}).text)
i tried to use ips.find_all but it didn't work.我尝试使用 ips.find_all 但它没有用。
from bs4 import BeautifulSoup
import requests
url = 'https://proxy-list.org/english/index.php'
pagecontent = requests.get(url)
soup = BeautifulSoup(browser.pagecontent, 'html.parser')
maintable = soup.find_all('div', {'class':'table'})
for div_element in maintable:
rows = div_element.find_all('li', class_='proxy')
for ip in rows:
print(ip.text)
If I get your question right, the following is one of the ways how you can fetch those proxies using requests module and Beautifulsoup library:如果我的问题正确,以下是使用 requests 模块和 Beautifulsoup 库获取这些代理的方法之一:
import re
import base64
import requests
from bs4 import BeautifulSoup
url = 'https://proxy-list.org/english/index.php'
def decode_proxy(target_str):
converted_proxy = base64.b64decode(target_str)
return converted_proxy.decode()
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
for tr in soup.select("#proxy-table li.proxy > script"):
proxy_id = re.findall(r"Proxy[^']+(.*)\'",tr.contents[0])[0]
print(decode_proxy(proxy_id))
First few results:前几个结果:
62.80.180.111:8080
68.183.221.156:38159
189.201.134.13:8080
178.60.201.44:8080
128.199.79.15:8080
139.59.78.193:8080
103.148.216.5:80
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.