![](/img/trans.png)
[英]Looping through a list of urls for web scraping with BeautifulSoup
[英]How to loop through a list of urls for web scraping with BeautifulSoup
有誰知道Beautifulsoup會從同一網站上抓取網址列表嗎? 列表= ['url1','url2','url3'...]
============================================ ========================
我的代碼提取網址列表:
url = 'http://www.hkjc.com/chinese/racing/selecthorsebychar.asp?ordertype=2'
url1 = 'http://www.hkjc.com/chinese/racing/selecthorsebychar.asp?ordertype=3'
url2 = 'http://www.hkjc.com/chinese/racing/selecthorsebychar.asp?ordertype=4'
r = requests.get(url)
r1 = requests.get(url1)
r2 = requests.get(url2)
data = r.text
soup = BeautifulSoup(data, 'lxml')
links = []
for link in soup.find_all('a', {'class': 'title_text'}):
links.append(link.get('href'))
data1 = r1.text
soup = BeautifulSoup(data1, 'lxml')
for link in soup.find_all('a', {'class': 'title_text'}):
links.append(link.get('href'))
data2 = r2.text
soup = BeautifulSoup(data2, 'lxml')
for link in soup.find_all('a', {'class': 'title_text'}):
links.append(link.get('href'))
new = ['http://www.hkjc.com/chinese/racing/']*1123
url_list = ['{}{}'.format(x,y) for x,y in zip(new,links)]
從單個頁面的URL中提取的代碼:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'myurl'
r = requests.get(myurl)
r.encoding = 'utf-8'
html_content = r.text
soup = BeautifulSoup(html_content, 'lxml')
soup.findAll('tr')[27].findAll('td')
column_headers = [th.getText() for th in
soup.findAll('tr')[27].findAll('td')]
data_rows =soup.findAll('tr')[29:67]
data_rows
player_data = [[td.getText() for td in data_rows[i].findAll('td', {'class':['htable_text', 'htable_eng_text']})]
for i in range(len(data_rows))]
player_data_02 = []
for i in range(len(data_rows)):
player_row = []
for td in data_rows[i].findAll('td'):
player_row.append(td.getText())
player_data_02.append(player_row)
df = pd.DataFrame(player_data, columns=column_headers[:18])
根據您的鏈接子集,表數據的收集如下:
from bs4 import BeautifulSoup as BS
import requests
import pandas as pd
url_list = ['http://www.hkjc.com/english/racing/horse.asp?HorseNo=S217','http://www.hkjc.com/english/racing/horse.asp?HorseNo=A093','http://www.hkjc.com/english/racing/horse.asp?HorseNo=V344','http://www.hkjc.com/english/racing/horse.asp?HorseNo=V077', 'http://www.hkjc.com/english/racing/horse.asp?HorseNo=P361', 'http://www.hkjc.com/english/racing/horse.asp?HorseNo=T103']
for link in url_list:
r = requests.get(link)
r.encoding = 'utf-8'
html_content = r.text
soup = BS(html_content, 'lxml')
table = soup.find('table', class_='bigborder')
if not table:
continue
trs = table.find_all('tr')
if not trs:
continue #if trs are not found, then starting next iteration with other link
headers = trs[0]
headers_list=[]
for td in headers.find_all('td'):
headers_list.append(td.text)
headers_list+=['Season']
headers_list.insert(19,'pseudocol1')
headers_list.insert(20,'pseudocol2')
headers_list.insert(21,'pseudocol3')
res=[]
row = []
season = ''
for tr in trs[1:]:
if 'Season' in tr.text:
season = tr.text
else:
tds = tr.find_all('td')
for td in tds:
row.append(td.text.strip('\n').strip('\r').strip('\t').strip('"').strip()) #clean data
row.append(season.strip())
res.append(row)
row=[]
res = [i for i in res if i[0]!='']
df=pd.DataFrame(res, columns=headers_list)
del df['pseudocol1'],df['pseudocol2'],df['pseudocol3']
del df['VideoReplay']
df.to_csv('/home/username/'+str(url_list.index(link))+'.csv')
如果要將所有表中的數據存儲到一個數據幀中,則需要進行以下修改:
from bs4 import BeautifulSoup as BS
import requests
import pandas as pd
url_list = ['http://www.hkjc.com/english/racing/horse.asp?HorseNo=S217','http://www.hkjc.com/english/racing/horse.asp?HorseNo=A093','http://www.hkjc.com/english/racing/horse.asp?HorseNo=V344','http://www.hkjc.com/english/racing/horse.asp?HorseNo=V077', 'http://www.hkjc.com/english/racing/horse.asp?HorseNo=P361', 'http://www.hkjc.com/english/racing/horse.asp?HorseNo=T103']
res=[] #placing res outside of loop
for link in url_list:
r = requests.get(link)
r.encoding = 'utf-8'
html_content = r.text
soup = BS(html_content, 'lxml')
table = soup.find('table', class_='bigborder')
if not table:
continue
trs = table.find_all('tr')
if not trs:
continue #if trs are not found, then starting next iteration with other link
headers = trs[0]
headers_list=[]
for td in headers.find_all('td'):
headers_list.append(td.text)
headers_list+=['Season']
headers_list.insert(19,'pseudocol1')
headers_list.insert(20,'pseudocol2')
headers_list.insert(21,'pseudocol3')
row = []
season = ''
for tr in trs[1:]:
if 'Season' in tr.text:
season = tr.text
else:
tds = tr.find_all('td')
for td in tds:
row.append(td.text.strip('\n').strip('\r').strip('\t').strip('"').strip())
row.append(season.strip())
res.append(row)
row=[]
res = [i for i in res if i[0]!=''] #outside of loop
df=pd.DataFrame(res, columns=headers_list) #outside of loop
del df['pseudocol1'],df['pseudocol2'],df['pseudocol3']
del df['VideoReplay']
df.to_csv('/home/Username/'+'tables.csv') #outside of loop
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.