使用 python 中的 beautifulsoup 從不同類型的 html 中提取數據

Question

我有以下類型的 HTML，我需要從中提取“學生證”。 我可以從下面的 HTML 中提取學生 ID，但我不確定如何修改我的代碼，以便我也可以從第二種類型的 HTML 中正確提取“學生 ID”。 類型1：

student_html='''
<div style= "position:absolute; border:textbook 1px solid">
  <span style="font-family: Helvetica; font-size:8px">
   Student ID
  <span style="font-family: Helvetica; font-size:8px">
   123456
   <br/>
  </span>
</div>

<div style= "position:absolute; border:textbook 1px solid">
  <span style="font-family: Helvetica; font-size:8px">
   Student Name
  <span style="font-family: Helvetica; font-size:8px">
   John Doe
   <br/>
  </span>
</div>
'''

我正在使用以下代碼從上面的 HTML 中提取“學生 ID”

from bs4 import BeautifulSoup
soup=BeautifulSoup(student_html,"lxml")
span_tags=soup.find_all("span")
for span in span_tags:
    if span.text.strip()=="Student ID":
       student_id=span.findNext("span").text
    if span.text.strip()=="Student Name":
       student_name=span.findNext("span").text

這是 HTML 的第二種類型。 類型2

type2HTML = '''<div style= "position:absolute; border:textbook 1px solid">
  <span style="font-family: Helvetica; font-size:8px">
   Student ID
   <br/>
   123456
   <br/>
  </span>
</div>
<div style= "position:absolute; border:textbook 1px solid">
  <span style="font-family: Helvetica; font-size:8px">
   Student Name
   <br/>
   John Doe
   <br/>
  </span>
</div>
'''

如何修改上面的代碼從中提取學生ID？同樣我需要提取其他信息：學生姓名，地址，年級等

Answer 1

你可以試試這個，一旦你從源HTML中挖出正確的<div>標簽。

例如：

from bs4 import BeautifulSoup

type_one = """
<div style= "position:absolute; border:textbook 1px solid">
  <span style="font-family: Helvetica; font-size:8px">
   Student ID
  <span style="font-family: Helvetica; font-size:8px">
   123456
   <br/>
  </span>
</div>"""

type_two = """<div style= "position:absolute; border:textbook 1px solid">
  <span style="font-family: Helvetica; font-size:8px">
   Student ID
   <br/>
   123456
   <br/>
  </span>
</div>
"""

all_types = [type_one, type_two]

for _type in all_types:
    _id = (
        BeautifulSoup(_type, "lxml")
        .find("span")
        .getText(strip=True, separator="|")
        .split("|")[-1]
    )
    print(_id)

Output：

123456
123456

Answer 2

如果您可以自由使用其他模塊，請考慮以下解決方案：

    from weblib.etree import parse_html
    from selection import XpathSelector

        student_html='''
    <div style= "position:absolute; border:textbook 1px solid">
      <span style="font-family: Helvetica; font-size:8px">
       Student ID
      <span style="font-family: Helvetica; font-size:8px">
       123456
       <br/>
      </span>
    </div>'''
    
        type2HTML = '''<div style= "position:absolute; border:textbook 1px solid">
      <span style="font-family: Helvetica; font-size:8px">
       Student ID
       <br/>
       123456
       <br/>
      </span>
    </div>'''

    all_types = [student_html, type2HTML]

    for _type in all_types:
        node = parse_html(_type)

        nodes = [node for node in XpathSelector(node).select('//span')]

        if len(nodes) == 1:
            content = nodes[0].text()
        else:
            content = nodes[1].text()

        student_id = content.replace('Student ID', '').strip()

        print(student_id)

output

123456
123456

使用 python 中的 beautifulsoup 從不同類型的 html 中提取數據

問題描述

2 個解決方案

解決方案1
1 2021-05-25 13:02:47

解決方案2
-1 2021-05-25 13:30:27

使用 python 中的 beautifulsoup 從不同類型的 html 中提取數據

問題描述

2 個解決方案

解決方案1 1 2021-05-25 13:02:47

解決方案2 -1 2021-05-25 13:30:27

解決方案1
1 2021-05-25 13:02:47

解決方案2
-1 2021-05-25 13:30:27