簡體   English   中英

提取br標簽美湯python

[英]extracting br tags beautiful soup python

我是 web 抓取的新手。 我正在嘗試從以下位於 br 標簽中的 html 代碼中提取地址文本“Tegelhof 1 33014 Bad Driburg”和“Tegelweg 2A 33014 Bad Driburg”。 但我沒有得到想要的結果。 到目前為止,我已經使用下面的代碼來獲取但沒有成功。 有人可以幫我怎么做

代碼:

address = [soup.find('div', class_='col-sm-4 pt-2')

完整源代碼:

<div class="row">
    <div class="col-sm-5 py-2">
        <br/>
        <span style="color:#7fb7c4; font-weight:600;">Praxis jetzt geöffnet</span>
        <p class="mt-5 d-none d-md-block">Telefon: <a class="it" href="tel:+4952531717">0 52 53 / 17 17</a></p>
    </div>
    <!-- sm-5 end -->
    <div class="col-sm-4 pt-2">
        <!-- <img class="mapicons" src="https://www.tk-aerztefuehrer.de/TK/images/GoogleImages/A.png" alt=" " /><br>  -->
        <br/>
        <img alt=" " src="https://www.tk-aerztefuehrer.de/TK/img/entfernung.svg"/>  0.2 km<br/>
        <span class="pt-3 d-none d-md-block"></span>
        Tegelhof 1<br/>
        33014 Bad Driburg<br/>
    </div><!-- sm-4 end -->
    <div class="col-sm-3">
    </div><!-- sm-3 end -->
</div><!--   end row  -->
<div class="row">
    <div class="col-sm-5 py-2">
        <br/>
        <span style="color:#7fb7c4; font-weight:600;">Praxis jetzt geöffnet</span>
        <p class="mt-5 d-none d-md-block">Telefon: <a class="it" href="tel:+4952536565">0 52 53 / 65 65</a></p>
    </div><!-- sm-5 end -->
    <div class="col-sm-4 pt-2">
        <!-- <img class="mapicons" src="https://www.tk-aerztefuehrer.de/TK/images/GoogleImages/A.png" alt=" " /><br>  -->
        <br/>
        <img alt=" " src="https://www.tk-aerztefuehrer.de/TK/img/entfernung.svg"/>  0.2 km<br/>
        <span class="pt-3 d-none d-md-block"></span>
        Tegelweg 2A<br/>
        33014 Bad Driburg<br/>
    </div><!-- sm-4 end -->
    <div class="col-sm-3">
    </div><!-- sm-3 end -->
</div><!--   end row  -->
html_doc="""<div class="row">
 <div class="col-sm-5 py-2">
 <br/><span style="color:#7fb7c4; font-weight:600;">Praxis jetzt geöffnet</span>
 <p class="mt-5 d-none d-md-block">Telefon: <a class="it" href="tel:+4952531717">0 52 53 / 17 17</a></p>
 </div><!-- sm-5 end -->
 <div class="col-sm-4 pt-2">
 <!-- <img class="mapicons" src="https://www.tk-aerztefuehrer.de/TK/images/GoogleImages/A.png" alt=" " /><br>  -->
 <br/>
 <img alt=" " src="https://www.tk-aerztefuehrer.de/TK/img/entfernung.svg"/>  0.2 km<br/>
 <span class="pt-3 d-none d-md-block"></span>
 Tegelhof 1<br/>
 33014 Bad Driburg<br/>
 </div><!-- sm-4 end -->
 <div class="col-sm-3">
 </div><!-- sm-3 end -->
 </div><!--   end row  -->

 <div class="row">
 <div class="col-sm-5 py-2">
 <br/><span style="color:#7fb7c4; font-weight:600;">Praxis jetzt geöffnet</span>
 <p class="mt-5 d-none d-md-block">Telefon: <a class="it" href="tel:+4952536565">0 52 53 / 65 65</a></p>
 </div><!-- sm-5 end -->
 <div class="col-sm-4 pt-2">
 <!-- <img class="mapicons" src="https://www.tk-aerztefuehrer.de/TK/images/GoogleImages/A.png" alt=" " /><br>  -->
 <br/>
 <img alt=" " src="https://www.tk-aerztefuehrer.de/TK/img/entfernung.svg"/>  0.2 km<br/>
 <span class="pt-3 d-none d-md-block"></span>
 Tegelweg 2A<br/>
 33014 Bad Driburg<br/>
 </div><!-- sm-4 end -->
 <div class="col-sm-3">
 </div><!-- sm-3 end -->
 </div><!--   end row  --> """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
address = soup.find_all('div', class_='col-sm-4 pt-2')
[i.text for i in address]

Output:

['\n\n\n  0.2 km\n\n Tegelhof 1\n 33014 Bad Driburg\n',
 '\n\n\n  0.2 km\n\n Tegelweg 2A\n 33014 Bad Driburg\n']

而不是find使用findall來獲取列表並使用text方法來獲取文本部分。

更新:

add1=[i.text.replace('\n', ' ').strip(' ')  for i in address]

使用replace\n替換為space並去除所有多余的空格 Output:

['0.2 km   Tegelhof 1  33014 Bad Driburg',
 '0.2 km   Tegelweg 2A  33014 Bad Driburg']

add2=[i.partition('km') for i in add1]
[[i[0]+i[1],i[2].strip(' ')] for i in add2]

Output:

[['0.2 km', 'Tegelhof 1  33014 Bad Driburg'],
 ['0.2 km', 'Tegelweg 2A  33014 Bad Driburg']]

我最初有解決方案來獲取地址,因為這就是你所要求的,但是如果你想要兩者,你可以將其他文本與該文本分開:

import re 
from bs4 import BeautifulSoup as bs

html = '''<div class="row">
 <div class="col-sm-5 py-2">
 <br/><span style="color:#7fb7c4; font-weight:600;">Praxis jetzt geöffnet</span>
 <p class="mt-5 d-none d-md-block">Telefon: <a class="it" href="tel:+4952531717">0 52 53 / 17 17</a></p>
 </div><!-- sm-5 end -->
 <div class="col-sm-4 pt-2">
 <!-- <img class="mapicons" src="https://www.tk-aerztefuehrer.de/TK/images/GoogleImages/A.png" alt=" " /><br>  -->
 <br/>
 <img alt=" " src="https://www.tk-aerztefuehrer.de/TK/img/entfernung.svg"/>  0.2 km<br/>
 <span class="pt-3 d-none d-md-block"></span>
 Tegelhof 1<br/>
 33014 Bad Driburg<br/>
 </div><!-- sm-4 end -->
 <div class="col-sm-3">
 </div><!-- sm-3 end -->
 </div><!--   end row  -->

 <div class="row">
 <div class="col-sm-5 py-2">
 <br/><span style="color:#7fb7c4; font-weight:600;">Praxis jetzt geöffnet</span>
 <p class="mt-5 d-none d-md-block">Telefon: <a class="it" href="tel:+4952536565">0 52 53 / 65 65</a></p>
 </div><!-- sm-5 end -->
 <div class="col-sm-4 pt-2">
 <!-- <img class="mapicons" src="https://www.tk-aerztefuehrer.de/TK/images/GoogleImages/A.png" alt=" " /><br>  -->
 <br/>
 <img alt=" " src="https://www.tk-aerztefuehrer.de/TK/img/entfernung.svg"/>  0.2 km<br/>
 <span class="pt-3 d-none d-md-block"></span>
 Tegelweg 2A<br/>
 33014 Bad Driburg<br/>
 </div><!-- sm-4 end -->
 <div class="col-sm-3">
 </div><!-- sm-3 end -->
 </div><!--   end row  -->'''
 
soup = bs(html, 'html.parser')

divs = soup.find_all('div', {'class':re.compile(r'pt-2')})
for div in divs:
    text_list = div.text.strip().split('\n')
    km = text_list[0]
    address = ' '.join([x for x in text_list[1:] if x !='']).strip()
    print(km)
    print(address)

Output:

0.2 km
Tegelhof 1  33014 Bad Driburg
0.2 km
Tegelweg 2A  33014 Bad Driburg

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM