[英]Python - Extracting data from the table
I am trying to extract data from the table and that I accessed by using beautiful soup library. 我试图从表中提取数据,并使用漂亮的汤类库进行访问。 I get the table as html but I am strugling to extract data in consumable form since the table itself has two columns with headers in first and values in second. 我将表获取为html,但由于表本身有两列,标题在第一列,值在第二列,因此我正在竭力提取可消耗形式的数据。
Here is my code: 这是我的代码:
html = browser.html
soup = bs(html, "html.parser")
table = soup.find("table", {"id":"productDetails_techSpec_section_1"})
table
Results of printing table: 打印表结果:
"<table class="a-keyvalue prodDetTable" id="productDetails_techSpec_section_1" role="presentation">
<tbody><tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Part Number
</th>
<td class="a-size-base">
3885SD
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Item Weight
</th>
<td class="a-size-base">
1.83 pounds
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Product Dimensions
</th>
<td class="a-size-base">
9 x 6 x 3.5 inches
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Item model number
</th>
<td class="a-size-base">
3885SD
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Item Package Quantity
</th>
<td class="a-size-base">
1
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Number of Handles
</th>
<td class="a-size-base">
1
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Included?
</th>
<td class="a-size-base">
No
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Required?
</th>
<td class="a-size-base">
No
</td>
</tr>
</tbody></table>"
I tried using this line of code to access each header and data point: 我尝试使用此行代码访问每个标头和数据点:
headings = [table.get_text() for th in table.find("tr").find_all("th")]
print(headings)
And this is the response i get: 这是我得到的回应:
['\n\n\n \tPart Number\t\n \n\n 3885SD\n \n\n\n\n Item Weight\n \n\n 1.83 pounds\n \n\n\n\n Product Dimensions\n \n\n 9 x 6 x 3.5 inches\n \n\n\n\n Item model number\n \n\n 3885SD\n \n\n\n\n Item Package Quantity\n \n\n 1\n \n\n\n\n Number of Handles\n \n\n 1\n \n\n\n\n Batteries Included?\n \n\n No\n \n\n\n\n Batteries Required?\n \n\n No\n \n\n']
I have been researching different approaches to get this data into pandas dataframe
and this is the closes i got so far. 我一直在研究不同的方法来将此数据导入pandas dataframe
而这是我到目前为止所获得的pandas dataframe
。 My questions is how do i get this data into data frame where my headers and values would be like example below? 我的问题是如何将这些数据放入标题和值如下所示的数据框中?
Ex. 防爆。
import pandas as pd
html = """<table class="a-keyvalue prodDetTable" id="productDetails_techSpec_section_1" role="presentation">
<tbody><tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Part Number </th>
<td class="a-size-base">3885SD</td></tr><tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Item Weight</th><td class="a-size-base">1.83 pounds</td></tr>
<tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Product Dimensions</th>
<td class="a-size-base">9 x 6 x 3.5 inches</td>
</tr><tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Item model number</th>
<td class="a-size-base">3885SD</td></tr>
<tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Item Package Quantity
</th><td class="a-size-base">1</td></tr><tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">Number of Handles
</th><td class="a-size-base">1</td></tr><tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">Batteries Included?
</th><td class="a-size-base">No</td></tr><tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Required?</th><td class="a-size-base">No</td></tr></tbody></table>"""
#read table data
df = pd.read_html(html)[0]
cols = df[0]
vals = df[1]
table = pd.DataFrame(vals).T
#reset columns name
table.columns = cols
print(table)
O/P: O / P:
0 Part Number Item Weight Product Dimensions Item model number Item Package Quantity Number of Handles Batteries Included? Batteries Required?
1 3885SD 1.83 pounds 9 x 6 x 3.5 inches 3885SD 1 1 No No
Solution: Create the function to parste the table: 解决方案:创建用于解析表的函数:
def parse_table(table):
""" Get data from table """
return [
[cell.get_text().strip() for cell in row.find_all(['th', 'td'])]
for row in table.find_all('tr')
]
Then create new table by using the function and convert the table into panda dataframe: 然后使用函数创建新表,并将该表转换为panda数据框:
new_table = parse_table(table)
df = pd.DataFrame(new_table)
df =df.T
df.columns = df.iloc[0]
df = df[1:]
df
You can use zip()
to transpose values in table: 您可以使用zip()
转置表格中的值:
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser') # data is your table from question
rows = []
for tr in soup.select('tr'):
rows.append([td.get_text(strip=True) for td in tr.select('th, td')])
rows = [*zip(*rows)] # transpose values
for row in rows:
print(''.join(r'{: <25}'.format(d) for d in row))
Prints: 打印:
Part Number Item Weight Product Dimensions Item model number Item Package Quantity Number of Handles Batteries Included? Batteries Required?
3885SD 1.83 pounds 9 x 6 x 3.5 inches 3885SD 1 1 No No
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.