简体   繁体   English

如何使用BeautifulSoup遍历URL列表进行Web抓取

[英]How to loop through a list of urls for web scraping with BeautifulSoup

Does anyone know how to scrape a list of urls from the same website by Beautifulsoup? 有谁知道Beautifulsoup会从同一网站上抓取网址列表吗? list = ['url1', 'url2', 'url3'...] 列表= ['url1','url2','url3'...]

========================================================================== ============================================ ========================

My code to extract a list of urls: 我的代码提取网址列表:

url = 'http://www.hkjc.com/chinese/racing/selecthorsebychar.asp?ordertype=2'
url1 = 'http://www.hkjc.com/chinese/racing/selecthorsebychar.asp?ordertype=3'
url2 = 'http://www.hkjc.com/chinese/racing/selecthorsebychar.asp?ordertype=4'

r  = requests.get(url)
r1  = requests.get(url1)
r2  = requests.get(url2)

data = r.text
soup = BeautifulSoup(data, 'lxml')
links = []

for link in soup.find_all('a', {'class': 'title_text'}):
    links.append(link.get('href'))

data1 = r1.text

soup = BeautifulSoup(data1, 'lxml')

for link in soup.find_all('a', {'class': 'title_text'}):
    links.append(link.get('href'))

data2 = r2.text

soup = BeautifulSoup(data2, 'lxml')

for link in soup.find_all('a', {'class': 'title_text'}):
    links.append(link.get('href'))

new = ['http://www.hkjc.com/chinese/racing/']*1123

url_list = ['{}{}'.format(x,y) for x,y in zip(new,links)]

code to extract from a single page of url: 从单个页面的URL中提取的代码:

from urllib.request import urlopen  
from bs4 import BeautifulSoup  
import requests  
import pandas as pd  

url = 'myurl'

r = requests.get(myurl)

r.encoding = 'utf-8'

html_content = r.text

soup = BeautifulSoup(html_content, 'lxml')

soup.findAll('tr')[27].findAll('td')

column_headers = [th.getText() for th in
                  soup.findAll('tr')[27].findAll('td')]

data_rows =soup.findAll('tr')[29:67]
data_rows

player_data = [[td.getText() for td in data_rows[i].findAll('td', {'class':['htable_text', 'htable_eng_text']})]
            for i in range(len(data_rows))]

player_data_02 = []

for i in range(len(data_rows)):
    player_row = []

    for td in data_rows[i].findAll('td'):
        player_row.append(td.getText())

    player_data_02.append(player_row)

df = pd.DataFrame(player_data, columns=column_headers[:18])

Based on your links subset collection of table data goes like this: 根据您的链接子集,表数据的收集如下:

from bs4 import BeautifulSoup as BS
import requests  
import pandas as pd  

url_list = ['http://www.hkjc.com/english/racing/horse.asp?HorseNo=S217','http://www.hkjc.com/english/racing/horse.asp?HorseNo=A093','http://www.hkjc.com/english/racing/horse.asp?HorseNo=V344','http://www.hkjc.com/english/racing/horse.asp?HorseNo=V077', 'http://www.hkjc.com/english/racing/horse.asp?HorseNo=P361', 'http://www.hkjc.com/english/racing/horse.asp?HorseNo=T103']


for link in url_list:
    r = requests.get(link)
    r.encoding = 'utf-8'

    html_content = r.text
    soup = BS(html_content, 'lxml')


    table = soup.find('table', class_='bigborder')
    if not table:
        continue

    trs = table.find_all('tr')

    if not trs:
        continue #if trs are not found, then starting next iteration with other link

    headers = trs[0]
    headers_list=[]
    for td in headers.find_all('td'):
        headers_list.append(td.text)
    headers_list+=['Season']
    headers_list.insert(19,'pseudocol1')
    headers_list.insert(20,'pseudocol2')
    headers_list.insert(21,'pseudocol3')

    res=[]
    row = []
    season = ''
    for tr in trs[1:]:
        if 'Season' in tr.text:
            season = tr.text

        else:
            tds = tr.find_all('td')
            for td in tds:
                row.append(td.text.strip('\n').strip('\r').strip('\t').strip('"').strip()) #clean data
            row.append(season.strip())
            res.append(row)
            row=[]

    res = [i for i in res if i[0]!='']

    df=pd.DataFrame(res, columns=headers_list)
    del df['pseudocol1'],df['pseudocol2'],df['pseudocol3']
    del df['VideoReplay']

    df.to_csv('/home/username/'+str(url_list.index(link))+'.csv')

if you want to store data from all tables to one dataframe, this little modification will do the trick: 如果要将所有表中的数据存储到一个数据帧中,则需要进行以下修改:

from bs4 import BeautifulSoup as BS
import requests  
import pandas as pd  

url_list = ['http://www.hkjc.com/english/racing/horse.asp?HorseNo=S217','http://www.hkjc.com/english/racing/horse.asp?HorseNo=A093','http://www.hkjc.com/english/racing/horse.asp?HorseNo=V344','http://www.hkjc.com/english/racing/horse.asp?HorseNo=V077', 'http://www.hkjc.com/english/racing/horse.asp?HorseNo=P361', 'http://www.hkjc.com/english/racing/horse.asp?HorseNo=T103']


res=[] #placing res outside of loop
for link in url_list:
    r = requests.get(link)
    r.encoding = 'utf-8'

    html_content = r.text
    soup = BS(html_content, 'lxml')


    table = soup.find('table', class_='bigborder')
    if not table:
        continue

    trs = table.find_all('tr')

    if not trs:
        continue #if trs are not found, then starting next iteration with other link


    headers = trs[0]
    headers_list=[]
    for td in headers.find_all('td'):
        headers_list.append(td.text)
    headers_list+=['Season']
    headers_list.insert(19,'pseudocol1')
    headers_list.insert(20,'pseudocol2')
    headers_list.insert(21,'pseudocol3')

    row = []
    season = ''
    for tr in trs[1:]:
        if 'Season' in tr.text:
            season = tr.text

        else:
            tds = tr.find_all('td')
            for td in tds:
                row.append(td.text.strip('\n').strip('\r').strip('\t').strip('"').strip())
            row.append(season.strip())
            res.append(row)
            row=[]

res = [i for i in res if i[0]!=''] #outside of loop

df=pd.DataFrame(res, columns=headers_list) #outside of loop
del df['pseudocol1'],df['pseudocol2'],df['pseudocol3'] 
del df['VideoReplay']

df.to_csv('/home/Username/'+'tables.csv') #outside of loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM