[英]Python - Extracting data from the table
我試圖從表中提取數據,並使用漂亮的湯類庫進行訪問。 我將表獲取為html,但由於表本身有兩列,標題在第一列,值在第二列,因此我正在竭力提取可消耗形式的數據。
這是我的代碼:
html = browser.html
soup = bs(html, "html.parser")
table = soup.find("table", {"id":"productDetails_techSpec_section_1"})
table
打印表結果:
"<table class="a-keyvalue prodDetTable" id="productDetails_techSpec_section_1" role="presentation">
<tbody><tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Part Number
</th>
<td class="a-size-base">
3885SD
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Item Weight
</th>
<td class="a-size-base">
1.83 pounds
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Product Dimensions
</th>
<td class="a-size-base">
9 x 6 x 3.5 inches
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Item model number
</th>
<td class="a-size-base">
3885SD
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Item Package Quantity
</th>
<td class="a-size-base">
1
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Number of Handles
</th>
<td class="a-size-base">
1
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Included?
</th>
<td class="a-size-base">
No
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Required?
</th>
<td class="a-size-base">
No
</td>
</tr>
</tbody></table>"
我嘗試使用此行代碼訪問每個標頭和數據點:
headings = [table.get_text() for th in table.find("tr").find_all("th")]
print(headings)
這是我得到的回應:
['\n\n\n \tPart Number\t\n \n\n 3885SD\n \n\n\n\n Item Weight\n \n\n 1.83 pounds\n \n\n\n\n Product Dimensions\n \n\n 9 x 6 x 3.5 inches\n \n\n\n\n Item model number\n \n\n 3885SD\n \n\n\n\n Item Package Quantity\n \n\n 1\n \n\n\n\n Number of Handles\n \n\n 1\n \n\n\n\n Batteries Included?\n \n\n No\n \n\n\n\n Batteries Required?\n \n\n No\n \n\n']
我一直在研究不同的方法來將此數據導入pandas dataframe
而這是我到目前為止所獲得的pandas dataframe
。 我的問題是如何將這些數據放入標題和值如下所示的數據框中?
防爆。
import pandas as pd
html = """<table class="a-keyvalue prodDetTable" id="productDetails_techSpec_section_1" role="presentation">
<tbody><tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Part Number </th>
<td class="a-size-base">3885SD</td></tr><tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Item Weight</th><td class="a-size-base">1.83 pounds</td></tr>
<tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Product Dimensions</th>
<td class="a-size-base">9 x 6 x 3.5 inches</td>
</tr><tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Item model number</th>
<td class="a-size-base">3885SD</td></tr>
<tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Item Package Quantity
</th><td class="a-size-base">1</td></tr><tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">Number of Handles
</th><td class="a-size-base">1</td></tr><tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">Batteries Included?
</th><td class="a-size-base">No</td></tr><tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Required?</th><td class="a-size-base">No</td></tr></tbody></table>"""
#read table data
df = pd.read_html(html)[0]
cols = df[0]
vals = df[1]
table = pd.DataFrame(vals).T
#reset columns name
table.columns = cols
print(table)
O / P:
0 Part Number Item Weight Product Dimensions Item model number Item Package Quantity Number of Handles Batteries Included? Batteries Required?
1 3885SD 1.83 pounds 9 x 6 x 3.5 inches 3885SD 1 1 No No
解決方案:創建用於解析表的函數:
def parse_table(table):
""" Get data from table """
return [
[cell.get_text().strip() for cell in row.find_all(['th', 'td'])]
for row in table.find_all('tr')
]
然后使用函數創建新表,並將該表轉換為panda數據框:
new_table = parse_table(table)
df = pd.DataFrame(new_table)
df =df.T
df.columns = df.iloc[0]
df = df[1:]
df
您可以使用zip()
轉置表格中的值:
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser') # data is your table from question
rows = []
for tr in soup.select('tr'):
rows.append([td.get_text(strip=True) for td in tr.select('th, td')])
rows = [*zip(*rows)] # transpose values
for row in rows:
print(''.join(r'{: <25}'.format(d) for d in row))
打印:
Part Number Item Weight Product Dimensions Item model number Item Package Quantity Number of Handles Batteries Included? Batteries Required?
3885SD 1.83 pounds 9 x 6 x 3.5 inches 3885SD 1 1 No No
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.