简体   繁体   English

Python-从表中提取数据

[英]Python - Extracting data from the table

I am trying to extract data from the table and that I accessed by using beautiful soup library. 我试图从表中提取数据,并使用漂亮的汤类库进行访问。 I get the table as html but I am strugling to extract data in consumable form since the table itself has two columns with headers in first and values in second. 我将表获取为html,但由于表本身有两列,标题在第一列,值在第二列,因此我正在竭力提取可消耗形式的数据。

Here is my code: 这是我的代码:

html = browser.html
soup = bs(html, "html.parser")

table = soup.find("table", {"id":"productDetails_techSpec_section_1"})
table

Results of printing table: 打印表结果:

"<table class="a-keyvalue prodDetTable" id="productDetails_techSpec_section_1" role="presentation">
<tbody><tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                    Part Number 
                </th>
<td class="a-size-base">
              3885SD
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Item Weight
                </th>
<td class="a-size-base">
              1.83 pounds
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Product Dimensions
                </th>
<td class="a-size-base">
              9 x 6 x 3.5 inches
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Item model number
                </th>
<td class="a-size-base">
              3885SD
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Item Package Quantity
                </th>
<td class="a-size-base">
              1
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Number of Handles
                </th>
<td class="a-size-base">
              1
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Batteries Included?
                </th>
<td class="a-size-base">
              No
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Batteries Required?
                </th>
<td class="a-size-base">
              No
            </td>
</tr>
</tbody></table>"

I tried using this line of code to access each header and data point: 我尝试使用此行代码访问每个标头和数据点:

headings = [table.get_text() for th in table.find("tr").find_all("th")]
print(headings)

And this is the response i get: 这是我得到的回应:

['\n\n\n                  \tPart Number\t\n                \n\n              3885SD\n            \n\n\n\n                  Item Weight\n                \n\n              1.83 pounds\n            \n\n\n\n                  Product Dimensions\n                \n\n              9 x 6 x 3.5 inches\n            \n\n\n\n                  Item model number\n                \n\n              3885SD\n            \n\n\n\n                  Item Package Quantity\n                \n\n              1\n            \n\n\n\n                  Number of Handles\n                \n\n              1\n            \n\n\n\n                  Batteries Included?\n                \n\n              No\n            \n\n\n\n                  Batteries Required?\n                \n\n              No\n            \n\n']

I have been researching different approaches to get this data into pandas dataframe and this is the closes i got so far. 我一直在研究不同的方法来将此数据导入pandas dataframe而这是我到目前为止所获得的pandas dataframe My questions is how do i get this data into data frame where my headers and values would be like example below? 我的问题是如何将这些数据放入标题和值如下所示的数据框中?

在此处输入图片说明

Ex. 防爆。

 import pandas as pd

html = """<table class="a-keyvalue prodDetTable" id="productDetails_techSpec_section_1" role="presentation">
 <tbody><tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Part Number </th>
 <td class="a-size-base">3885SD</td></tr><tr>
 <th class="a-color-secondary a-size-base prodDetSectionEntry">
 Item Weight</th><td class="a-size-base">1.83 pounds</td></tr>
 <tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Product Dimensions</th>
 <td class="a-size-base">9 x 6 x 3.5 inches</td>
 </tr><tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Item model number</th>
 <td class="a-size-base">3885SD</td></tr>
 <tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Item Package Quantity
 </th><td class="a-size-base">1</td></tr><tr>
 <th class="a-color-secondary a-size-base prodDetSectionEntry">Number of Handles
 </th><td class="a-size-base">1</td></tr><tr>
 <th class="a-color-secondary a-size-base prodDetSectionEntry">Batteries Included?
 </th><td class="a-size-base">No</td></tr><tr>
 <th class="a-color-secondary a-size-base prodDetSectionEntry">
  Batteries Required?</th><td class="a-size-base">No</td></tr></tbody></table>"""

#read table data
df = pd.read_html(html)[0]
cols = df[0]
vals = df[1]

table = pd.DataFrame(vals).T
#reset columns name
table.columns = cols
print(table)

O/P: O / P:

0 Part Number  Item Weight  Product Dimensions Item model number Item Package Quantity Number of Handles Batteries Included? Batteries Required?
1      3885SD  1.83 pounds  9 x 6 x 3.5 inches            3885SD                     1                 1                  No                  No

Solution: Create the function to parste the table: 解决方案:创建用于解析表的函数:

def parse_table(table):
    """ Get data from table """
    return [
        [cell.get_text().strip() for cell in row.find_all(['th', 'td'])]
           for row in table.find_all('tr')
    ]

Then create new table by using the function and convert the table into panda dataframe: 然后使用函数创建新表,并将该表转换为panda数据框:

new_table = parse_table(table)
df = pd.DataFrame(new_table)
df =df.T
df.columns = df.iloc[0]
df = df[1:]
df

You can use zip() to transpose values in table: 您可以使用zip()转置表格中的值:

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser') # data is your table from question

rows = []
for tr in soup.select('tr'):
    rows.append([td.get_text(strip=True) for td in tr.select('th, td')])

rows = [*zip(*rows)]    # transpose values

for row in rows:
    print(''.join(r'{: <25}'.format(d) for d in row))

Prints: 打印:

Part Number              Item Weight              Product Dimensions       Item model number        Item Package Quantity    Number of Handles        Batteries Included?      Batteries Required?      
3885SD                   1.83 pounds              9 x 6 x 3.5 inches       3885SD                   1                        1                        No                       No                       

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM