[英]Extract data between br Tag with beautifulsoup
如何使用beautifulsoup从以下html代码中提取INFO1
和INFO2
?
网站: https : //swisswrestling.ch/wrestlers? id =91
我也尝试为战斗提取数据,但这对我来说适用于 bs4。
<tr>
<td>
<span class="text-danger font-weight-bold"><br/>
<b>Notice</b>: Undefined index: HTTP_ACCEPT_LANGUAGE in <b>/srv/www/chroot/site05/web/app/bootstrap.php</b> on line <b>159</b><br/>
wrestlers_club</span> INFO1<br/>
<span class="text-danger font-weight-bold">wrestlers_birthday</span> INFO2<br/><br/>
<span class="text-danger font-weight-bold">wrestlers_licence_number</span> wrestlers_licence_no_licence<br/>
<span class="text-danger font-weight-bold">wrestlers_club_dl</span> wrestlers_licence_no_dl
</td>
我的代码现在看起来:
info = soup.find('div', id='content')
club = info.findAll('span')
for clubs in club:
test = clubs.text
print(test)
结果是:
注意:
Undefined index: HTTP_ACCEPT_LANGUAGE in /srv/www/chroot/site05/web/app/bootstrap.php on line 159
wrestlers_club
wrestlers_birthday
wrestlers_licence_number
wrestlers_club_dl
如何提取 wreslters_club (INFO1)
和 wrestlers_birthday (INFO2)
背后的数据?
谢谢你的帮助!
使用以下Css selector
和find_next_sibling(text=True)
import requests
from bs4 import BeautifulSoup
res=requests.get("https://swisswrestling.ch/wrestlers?id=91")
soup=BeautifulSoup(res.text,'lxml')
print(soup.select_one('span.text-danger:nth-of-type(1)').find_next_sibling(text=True).strip())
print(soup.select_one('span.text-danger:nth-of-type(2)').find_next_sibling(text=True).strip())
输出:
RC Willisau Lions
29 (1990)
这对你有用吗? 我使用了您发布的链接,它提取了所需的信息。
import requests
import re
import lxml
import ssl
from bs4 import BeautifulSoup as bs
ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://swisswrestling.ch/wrestlers?id=91'
strx = requests.get(url).text
regex = r"(wrestlers_club|wrestlers_birthday) (.*)\n"
soup = bs(strx, 'lxml')
for i in soup.find_all('td'):
print (*re.findall(regex,i.text), sep="\n")
输出:
('wrestlers_club', 'RC Willisau Lions')
('wrestlers_birthday', '29 (1990)')
你可以试试这个,它给了我你要求的结果:
info = soup.find('div', id='content')
club = info.select('#content-element-94 > div.row.border.mx-0 > div > table > tbody > tr > td:nth-child(1)')
for clubs in club:
test = clubs.text
print(test)
输出:
wrestlers_club RC Willisau Lions
wrestlers_birthday 29 (1990)
wrestlers_licence_number wrestlers_licence_no_licence
wrestlers_club_dl wrestlers_licence_no_dl
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.