[英]Extract text in between <br/> tags using BeautifulSoup to separate panda columns
我有一個 HTML 數據表抓取(參見下面的示例),我試圖將其保存到熊貓 df 中。 我可以成功地提取每一行並將每個 HTML <td>
列解析為 df 中的一個單獨列(請參閱下面的代碼)。 我遇到的問題是,在某些列中,有多個數據項由<br>
或<nobr>
分隔。 由<br>
或<nobr>
分隔的每個元素都應進入其自己單獨的 df 列(請參閱下面的當前和所需的 df 列)。 例如,當前代碼將日期和時間數據輸出為09.10.201918:5020:25
而不是將日期09.10.2019
、出發時間18:50
和到達時間20:25
分開到 df 中各自的列中。
示例 HTML 行
<tr valign="top"><td align="right" class="liste_gross">548<br/></td><td class="liste"><nobr>02.01.2018</nobr>
<br/>08:45<br/>14:55 </td><td class="liste_gross"><b>MEL</b><br/></td><td class="liste"><b>Melbourne</b><br/>
Australia<br/>Tullamarine</td><td class="liste_gross"><b>HKG</b><br/></td><td class="liste"><b>Hong Kong</b><br/>
China<br/>International</td><th align="right" class="liste_gross"><table border="0" cellpadding="0" cellspacing="0">
<tr><td align="right">7,420 </td><td>km</td></tr><tr><td align="right">9:25 </td><td>h</td></tr></table>
</th><td class="liste">Cathay Pacific<br/>CX34</td><td class="liste">A350-900<br/>B-LRR</td>
<td class="liste">32A/Window<br/><small>EconomyPlus<br/>Passenger<br/>Personal</small></td><td class="liste">
<br/><select onchange="if (this.value != 'NIL') location.href=this.value;" style="width:60px;">
<option value="NIL">Flight</option><option value="?go=flugdaten_edit&id=14619399&dbpos=0">edit</option>
<option value="NIL">----------</option><option value="?go=flugdaten_loeschen&id=14619399&dbpos=0">delete
</option></select></td></tr>
具有當前列名的 Python 代碼
soup = BeautifulSoup(response, 'html.parser') # Parse the response using BeautifulSoup
table = soup.find('table', attrs={'cellspacing' : 2}) # Select the only table with this attribute
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
dftable = pd.DataFrame(data, columns = ['flightno', 'date.timedept.timearr', 'codedept',
'citydept.countrydept.namedept', 'codearr', 'cityarr.countryarr.namearr',
'dist', 'distunits', 'time', 'timeunits', 'airline.flightno',
'manuf.type.rego', 'seat.loc.class.pass.reason', 'inputcol'])
dftable = dftable.dropna() # Drop incomplete rows
所需列名列表
dftable = pd.DataFrame(data, columns = ['flightno', 'date', 'timedept', 'timearr', 'codedept',
'citydept', 'countrydept', 'namedept', 'codearr', 'cityarr', 'countryarr',
'namearr', 'dist', 'distunits', 'time', 'timeunits', 'airline', 'flightno',
'manuf', 'type', 'rego', 'seat', 'loc', 'class', 'pass', 'reason', 'inputcol'])
也許這可以幫助您...您可以使用 pd.read_html 來解析 html 表,如下所示:
from bs4 import BeautifulSoup
import pandas as pd
import re
soup = BeautifulSoup(open("table.html"), "lxml")
# Replace <br> by | ...
s = re.sub('<br\s*/>','|', str(soup))
df_table = pd.read_html(s)
# To dataframe
df_table=df_table[0]
df_table.columns = ['flightno', 'fulldate','codedept','full_dept', 'countrydept', 'full_arr', 'KM', 'date_plane', 'date_plane_2','date_pass', 'inputcol']
#Split columns using value |
df_table[['date','timedept','timearr']] = df_table['fulldate'].str.split('|', expand=True)
df_table[['citydept','countrydept','namedept']] = df_table['full_dept'].str.split('|', expand=True)
df_table[['cityarr','countryarr','namearr']] = df_table['full_arr'].str.split('|', expand=True)
df_table[['airline','flightno']] = df_table['date_plane'].str.split('|', expand=True)
df_table[['manuf','type']] = df_table['date_plane_2'].str.split('|', expand=True)
df_table[['full_seat','class','pass','reason']] = df_table['date_pass'].str.split('|', expand=True)
df_table[['seat', 'loc']] = df_table['full_seat'].str.split('/', expand=True)
#Drop columns not necessary
df_table.drop(['fulldate','full_dept','full_arr','date_plane','date_plane_2','date_pass','full_seat'], axis=1, inplace=True)
#print(df_table)
df_table.to_csv('table_to_csv.csv')
table.html 包含:
<!DOCTYPE html>
<html>
<body>
<table border='1'>
<tr valign="top"><td align="right" class="liste_gross">548<br/></td><td class="liste"><nobr>02.01.2018</nobr>
<br/>08:45<br/>14:55 </td><td class="liste_gross"><b>MEL</b><br/></td><td class="liste"><b>Melbourne</b><br/>
Australia<br/>Tullamarine</td><td class="liste_gross"><b>HKG</b><br/></td><td class="liste"><b>Hong Kong</b><br/>
China<br/>International</td><th align="right" class="liste_gross"><table border="0" cellpadding="0" cellspacing="0">
<tr><td align="right">7,420 </td><td>km</td></tr><tr><td align="right">9:25 </td><td>h</td></tr></table>
</th><td class="liste">Cathay Pacific<br/>CX34</td><td class="liste">A350-900<br/>B-LRR</td>
<td class="liste">32A/Window<br/><small>EconomyPlus<br/>Passenger<br/>Personal</small></td><td class="liste">
<br/><select onchange="if (this.value != 'NIL') location.href=this.value;" style="width:60px;">
<option value="NIL">Flight</option><option value="?go=flugdaten_edit&id=14619399&dbpos=0">edit</option>
<option value="NIL">----------</option><option value="?go=flugdaten_loeschen&id=14619399&dbpos=0">delete
</option></select></td></tr></table>
</body>
</html>
這是一個正則表達式 + SimplifiedDoc 解決方案
import re
from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''<table cellspacing="2"><tr valign="top"><td align="right" class="liste_gross">548<br/></td><td class="liste"><nobr>02.01.2018</nobr>
<br/>08:45<br/>14:55 </td><td class="liste_gross"><b>MEL</b><br/></td><td class="liste"><b>Melbourne</b><br/>
Australia<br/>Tullamarine</td><td class="liste_gross"><b>HKG</b><br/></td><td class="liste"><b>Hong Kong</b><br/>
China<br/>International</td><th align="right" class="liste_gross"><table border="0" cellpadding="0" cellspacing="0">
<tr><td align="right">7,420 </td><td>km</td></tr><tr><td align="right">9:25 </td><td>h</td></tr></table>
</th><td class="liste">Cathay Pacific<br/>CX34</td><td class="liste">A350-900<br/>B-LRR</td>
<td class="liste">32A/Window<br/><small>EconomyPlus<br/>Passenger<br/>Personal</small></td><td class="liste">
<br/><select onchange="if (this.value != 'NIL') location.href=this.value;" style="width:60px;">
<option value="NIL">Flight</option><option value="?go=flugdaten_edit&id=14619399&dbpos=0">edit</option>
<option value="NIL">----------</option><option value="?go=flugdaten_loeschen&id=14619399&dbpos=0">delete
</option></select></td></tr></table>
'''
doc = SimplifiedDoc(html)
table = doc.getElement('table',attr="cellspacing",value="2")
rows = table.trs # get all rows
data = []
for row in rows:
arr = []
# cols = row.tds # get all tds
cols = row.children # td and th
i = 0
while i<len(cols):
if i==1: # for example
items = re.split('<br\s*/>',cols[i].html)
for item in items:
arr.append(doc.removeHtml(item))
elif cols[i].tag=='th': # deal it by yourself
tds = cols[i].tds
print (tds)
else:
arr.append(cols[i].text)
i+=1
data.append(arr)
print (data) # [['548', '02.01.2018', '08:45', '14:55', 'MEL', 'MelbourneAustraliaTullamarine', 'HKG', 'Hong KongChinaInternational', 'Cathay PacificCX34', 'A350-900B-LRR', '32A/WindowEconomyPlusPassengerPersonal', 'Flightedit----------delete']]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.