繁体   English   中英

使用 BeautifulSoup 从非结构化表行中提取地址

[英]Extract address from unstructured table row with BeautifulSoup

我有一个 HTML 文档,我想在其中提取地址,但无法提取。 这是 HTML 文档。 它包含一个没有用括号括起来的地址,像我这样的初学者如果没有它就无法提取它(例如使用find()或类似的)。

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<table class="novip">
    <tr class="novip">
        <td class="novip-portrait-picture"
            rowspan="5">
            <a class="novip" href="refer.html">URL</a>
        </td>
        <td class="novip-left">
            <a class="novip-firmen-name"
               href="refer.html"
               target="_top">
                John Doe
            </a>
        </td>
        <td class="novip-right"
            rowspan="2">
            <a class="novip" href="refer.html">URL</a>
        </td>
    </tr>
    <tr class="novip">
        <td class="novip-left">
            <span class="novip-left-titel">
              Prof.
            </span>
            <span class="novip-left-fachbezeichnung">
              Professor for History
            </span>
            <br/>
            Rose Avenue 33, 4302843 A City
            <br/>
            Tel:&nbsp;<a>234 23 43244</a>
            &nbsp;&nbsp;
            <a class="novip-left-make_appointment-button-active">Booking</a>
            &nbsp;&nbsp;
        </td>
    </tr>

</table>

</body>
</html>

我想提取地址Rose Avenue 33, 4302843 A City

这是我的尝试,但我无法缩小范围。

from bs4 import BeautifulSoup


r = requests.get(url)
r.encoding = 'utf8'
html_doc = r.text
soup = BeautifulSoup(html_doc, features='html5lib')
table = []

tables = soup.find_all("table", {"class": "novip"})

for table in tables:
    rows = table.findChildren('tr')
    
    address = rows[1].find('span', 'novip-left-fachbezeichnung').text

以下代码将近似您的尝试。 它基于 bs4 (BeautifulSoup)、pandas 和请求:

import requests
from bs4 import BeautifulSoup 
import pandas as pd

url =  'https://www.doktor.ch/gynaekologen/gynaekologen_k_lu.html'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
dr_list = []
doctor_cards = soup.select('table.novip')
for card in doctor_cards:
    try:
        dr_name = card.select_one('a.novip-firmen-name').text.strip()
    except Exception as e:
        dr_name = 'No Name'
    try:
        dr_url = card.select_one('a').get('href')
    except Exception as e:
        dr_url = 'No Url'
    try:
        dr_title = card.select_one('span.novip-left-titel').text.strip()
    except Exception as e:
        dr_title = 'No title'
    try:
        dr_specialisation = card.select_one('span.novip-left-fachbezeichnung').text.strip()
    except Exception as e:
        dr_specialisation = 'No specialisation'
    try:
        dr_address_span = card.select_one('span.novip-left-adresszusatz')
        dr_address = dr_address_span.text.strip() + ' ' + dr_address_span.next_sibling.strip()
    except Exception as e:
        dr_address_span = 'No address'
        if len(card.select_one('span.novip-left-fachbezeichnung').next_sibling.strip()) > 5:
            dr_address = card.select_one('span.novip-left-fachbezeichnung').next_sibling.strip().replace('\n', ' ')
        elif len(card.select_one('span.novip-left-fachbezeichnung').next_sibling.next_sibling) > 5:
            dr_address = card.select_one('span.novip-left-fachbezeichnung').next_sibling.next_sibling.text.strip().replace('\n', ' ')
        else:
            dr_address = card.select_one('span.novip-left-fachbezeichnung').next_sibling.next_sibling.next_sibling.strip().replace('\n', ' ')

    dr_list.append((dr_name, dr_title, dr_specialisation, dr_address))
df = pd.DataFrame(dr_list, columns = ['Name', 'Title', 'Spec', 'Address'])
df.to_csv('swiss_docs.csv')
print(df.head())

这将保存带有博士详细信息的 csv 文件,如下所示:

Name    Title   Spec    Address
0   Wey Barbara Dr. med.    Fachärztin FMH für Gynäkologie u. Geburtshilfe  Hauptstrasse 12, 6033 Buchrain Tel: 041 444 30 80    Terminanfrage    Karte
1   Bohl Urs    Dr. med.    Facharzt FMH für Gynäkologie und Geburtshilfe   Seetalstrasse 11, 6020 Emmenbrücke
2   Füchsel Glenn   Dr. med.    Facharzt für Gynäkologie und Geburtshilfe   docstation Gesundheitszentrum Emmen Mooshüslistrasse 6, 6032 Emmen
3   Dal Pian Désirée    Dr. med.    Fachärztin FMH für Gynäkologie u. Geburtshilfe  Frauenpraxis Zero Plus Am Mattenhof 4a, 6010 Kriens
4   Gilke Ursula    Dr. med.    Fachärztin für Gynäkologie u. Geburtshilfe  Schachenstrasse 5, 6010 Kriens
5   Amann Stefanie  Dr. med.    Fachärztin FMH Gynäkologie u. Geburtshilfe  Frauenpraxis am See Alpenstrasse 1, 6004 Luzern
6   Ballabio Nadja  Dr. med.    Fachärztin FMH Gynäkologie und Geburtshilfe gyn-zentrum ag Haldenstrasse 11, 6006 Luzern
[...]

有更好、更优雅的解决方案。 查看 bs4 文档,位于https://www.crummy.com/software/BeautifulSoup/bs4/doc/

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM