[英]Extracting data from different type of html using beautifulsoup in python
我有以下類型的 HTML,我需要從中提取“學生證”。 我可以從下面的 HTML 中提取學生 ID,但我不確定如何修改我的代碼,以便我也可以從第二種類型的 HTML 中正確提取“學生 ID”。 類型1:
student_html='''
<div style= "position:absolute; border:textbook 1px solid">
<span style="font-family: Helvetica; font-size:8px">
Student ID
<span style="font-family: Helvetica; font-size:8px">
123456
<br/>
</span>
</div>
<div style= "position:absolute; border:textbook 1px solid">
<span style="font-family: Helvetica; font-size:8px">
Student Name
<span style="font-family: Helvetica; font-size:8px">
John Doe
<br/>
</span>
</div>
'''
我正在使用以下代碼從上面的 HTML 中提取“學生 ID”
from bs4 import BeautifulSoup
soup=BeautifulSoup(student_html,"lxml")
span_tags=soup.find_all("span")
for span in span_tags:
if span.text.strip()=="Student ID":
student_id=span.findNext("span").text
if span.text.strip()=="Student Name":
student_name=span.findNext("span").text
這是 HTML 的第二種類型。 類型2
type2HTML = '''<div style= "position:absolute; border:textbook 1px solid">
<span style="font-family: Helvetica; font-size:8px">
Student ID
<br/>
123456
<br/>
</span>
</div>
<div style= "position:absolute; border:textbook 1px solid">
<span style="font-family: Helvetica; font-size:8px">
Student Name
<br/>
John Doe
<br/>
</span>
</div>
'''
如何修改上面的代碼從中提取學生ID?同樣我需要提取其他信息:學生姓名,地址,年級等
你可以試試這個,一旦你從源HTML
中挖出正確的<div>
標簽。
例如:
from bs4 import BeautifulSoup
type_one = """
<div style= "position:absolute; border:textbook 1px solid">
<span style="font-family: Helvetica; font-size:8px">
Student ID
<span style="font-family: Helvetica; font-size:8px">
123456
<br/>
</span>
</div>"""
type_two = """<div style= "position:absolute; border:textbook 1px solid">
<span style="font-family: Helvetica; font-size:8px">
Student ID
<br/>
123456
<br/>
</span>
</div>
"""
all_types = [type_one, type_two]
for _type in all_types:
_id = (
BeautifulSoup(_type, "lxml")
.find("span")
.getText(strip=True, separator="|")
.split("|")[-1]
)
print(_id)
Output:
123456
123456
如果您可以自由使用其他模塊,請考慮以下解決方案:
from weblib.etree import parse_html
from selection import XpathSelector
student_html='''
<div style= "position:absolute; border:textbook 1px solid">
<span style="font-family: Helvetica; font-size:8px">
Student ID
<span style="font-family: Helvetica; font-size:8px">
123456
<br/>
</span>
</div>'''
type2HTML = '''<div style= "position:absolute; border:textbook 1px solid">
<span style="font-family: Helvetica; font-size:8px">
Student ID
<br/>
123456
<br/>
</span>
</div>'''
all_types = [student_html, type2HTML]
for _type in all_types:
node = parse_html(_type)
nodes = [node for node in XpathSelector(node).select('//span')]
if len(nodes) == 1:
content = nodes[0].text()
else:
content = nodes[1].text()
student_id = content.replace('Student ID', '').strip()
print(student_id)
output
123456
123456
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.