用beautifulsoup提取br标签之间的数据

Question

how can I extract INFO1 and INFO2 from following html code with beautifulsoup?如何使用beautifulsoup从以下html代码中提取INFO1和INFO2 ？

webiste: https://swisswrestling.ch/wrestlers?id=91网站： https : //swisswrestling.ch/wrestlers? id =91

I try also to extract data for the fights but that works for me with bs4.我也尝试为战斗提取数据，但这对我来说适用于 bs4。

    <tr>
<td>
<span class="text-danger font-weight-bold"><br/>
<b>Notice</b>:  Undefined index: HTTP_ACCEPT_LANGUAGE in <b>/srv/www/chroot/site05/web/app/bootstrap.php</b> on line <b>159</b><br/>
wrestlers_club</span> INFO1<br/>
<span class="text-danger font-weight-bold">wrestlers_birthday</span> INFO2<br/><br/>
<span class="text-danger font-weight-bold">wrestlers_licence_number</span> wrestlers_licence_no_licence<br/>
<span class="text-danger font-weight-bold">wrestlers_club_dl</span> wrestlers_licence_no_dl                                                
                    </td>

My code looks now:我的代码现在看起来：

info = soup.find('div', id='content')
club = info.findAll('span')
for clubs in club:
 test = clubs.text
 print(test)

And the Result is:结果是：

Notice :注意：

 Undefined index: HTTP_ACCEPT_LANGUAGE in /srv/www/chroot/site05/web/app/bootstrap.php on line 159

wrestlers_club wrestlers_club

wrestlers_birthday wrestlers_birthday

wrestlers_licence_number wrestlers_licence_number

wrestlers_club_dl wrestlers_club_dl

How can I extract the data behind wreslters_club (INFO1) and wrestlers_birthday (INFO2) ?如何提取 wreslters_club (INFO1)和 wrestlers_birthday (INFO2)背后的数据？

Thanks for your help!谢谢你的帮助！

Answer 1

Use following Css selector and find_next_sibling(text=True)使用以下Css selector和find_next_sibling(text=True)

    import requests
    from bs4 import BeautifulSoup

    res=requests.get("https://swisswrestling.ch/wrestlers?id=91")
    soup=BeautifulSoup(res.text,'lxml')
    print(soup.select_one('span.text-danger:nth-of-type(1)').find_next_sibling(text=True).strip())
    print(soup.select_one('span.text-danger:nth-of-type(2)').find_next_sibling(text=True).strip())

Output :输出：

RC Willisau Lions
29 (1990)

Answer 2

Does this work for you?这对你有用吗？ I used the link you posted and it extracts the required information.我使用了您发布的链接，它提取了所需的信息。

import requests
import re
import lxml
import ssl
from bs4 import BeautifulSoup as bs
ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://swisswrestling.ch/wrestlers?id=91'

strx = requests.get(url).text
regex = r"(wrestlers_club|wrestlers_birthday) (.*)\n"
soup = bs(strx, 'lxml')


for i in soup.find_all('td'):
    print (*re.findall(regex,i.text), sep="\n")

Output:输出：

('wrestlers_club', 'RC Willisau Lions')
('wrestlers_birthday', '29 (1990)')

Answer 3

You can try this, it gave me the result you requested:你可以试试这个，它给了我你要求的结果：

     info = soup.find('div', id='content')
     club = info.select('#content-element-94 > div.row.border.mx-0 > div > table > tbody > tr > td:nth-child(1)')
     for clubs in club:
       test = clubs.text
       print(test)

output:输出：

       wrestlers_club RC Willisau Lions
       wrestlers_birthday 29 (1990)
       wrestlers_licence_number wrestlers_licence_no_licence
       wrestlers_club_dl wrestlers_licence_no_dl

用beautifulsoup提取br标签之间的数据

问题描述

3 个解决方案

解决方案1
2 已采纳 2020-02-04 15:54:14

解决方案2
1 2020-02-04 14:44:40

解决方案3
1 2020-02-04 14:50:05

用beautifulsoup提取br标签之间的数据

问题描述

3 个解决方案

解决方案1 2 已采纳 2020-02-04 15:54:14

解决方案2 1 2020-02-04 14:44:40

解决方案3 1 2020-02-04 14:50:05

解决方案1
2 已采纳 2020-02-04 15:54:14

解决方案2
1 2020-02-04 14:44:40

解决方案3
1 2020-02-04 14:50:05