簡體   English   中英

Python-從表中提取數據

[英]Python - Extracting data from the table

我試圖從表中提取數據,並使用漂亮的湯類庫進行訪問。 我將表獲取為html,但由於表本身有兩列,標題在第一列,值在第二列,因此我正在竭力提取可消耗形式的數據。

這是我的代碼:

html = browser.html
soup = bs(html, "html.parser")

table = soup.find("table", {"id":"productDetails_techSpec_section_1"})
table

打印表結果:

"<table class="a-keyvalue prodDetTable" id="productDetails_techSpec_section_1" role="presentation">
<tbody><tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                    Part Number 
                </th>
<td class="a-size-base">
              3885SD
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Item Weight
                </th>
<td class="a-size-base">
              1.83 pounds
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Product Dimensions
                </th>
<td class="a-size-base">
              9 x 6 x 3.5 inches
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Item model number
                </th>
<td class="a-size-base">
              3885SD
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Item Package Quantity
                </th>
<td class="a-size-base">
              1
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Number of Handles
                </th>
<td class="a-size-base">
              1
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Batteries Included?
                </th>
<td class="a-size-base">
              No
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Batteries Required?
                </th>
<td class="a-size-base">
              No
            </td>
</tr>
</tbody></table>"

我嘗試使用此行代碼訪問每個標頭和數據點:

headings = [table.get_text() for th in table.find("tr").find_all("th")]
print(headings)

這是我得到的回應:

['\n\n\n                  \tPart Number\t\n                \n\n              3885SD\n            \n\n\n\n                  Item Weight\n                \n\n              1.83 pounds\n            \n\n\n\n                  Product Dimensions\n                \n\n              9 x 6 x 3.5 inches\n            \n\n\n\n                  Item model number\n                \n\n              3885SD\n            \n\n\n\n                  Item Package Quantity\n                \n\n              1\n            \n\n\n\n                  Number of Handles\n                \n\n              1\n            \n\n\n\n                  Batteries Included?\n                \n\n              No\n            \n\n\n\n                  Batteries Required?\n                \n\n              No\n            \n\n']

我一直在研究不同的方法來將此數據導入pandas dataframe而這是我到目前為止所獲得的pandas dataframe 我的問題是如何將這些數據放入標題和值如下所示的數據框中?

在此處輸入圖片說明

防爆。

 import pandas as pd

html = """<table class="a-keyvalue prodDetTable" id="productDetails_techSpec_section_1" role="presentation">
 <tbody><tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Part Number </th>
 <td class="a-size-base">3885SD</td></tr><tr>
 <th class="a-color-secondary a-size-base prodDetSectionEntry">
 Item Weight</th><td class="a-size-base">1.83 pounds</td></tr>
 <tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Product Dimensions</th>
 <td class="a-size-base">9 x 6 x 3.5 inches</td>
 </tr><tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Item model number</th>
 <td class="a-size-base">3885SD</td></tr>
 <tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Item Package Quantity
 </th><td class="a-size-base">1</td></tr><tr>
 <th class="a-color-secondary a-size-base prodDetSectionEntry">Number of Handles
 </th><td class="a-size-base">1</td></tr><tr>
 <th class="a-color-secondary a-size-base prodDetSectionEntry">Batteries Included?
 </th><td class="a-size-base">No</td></tr><tr>
 <th class="a-color-secondary a-size-base prodDetSectionEntry">
  Batteries Required?</th><td class="a-size-base">No</td></tr></tbody></table>"""

#read table data
df = pd.read_html(html)[0]
cols = df[0]
vals = df[1]

table = pd.DataFrame(vals).T
#reset columns name
table.columns = cols
print(table)

O / P:

0 Part Number  Item Weight  Product Dimensions Item model number Item Package Quantity Number of Handles Batteries Included? Batteries Required?
1      3885SD  1.83 pounds  9 x 6 x 3.5 inches            3885SD                     1                 1                  No                  No

解決方案:創建用於解析表的函數:

def parse_table(table):
    """ Get data from table """
    return [
        [cell.get_text().strip() for cell in row.find_all(['th', 'td'])]
           for row in table.find_all('tr')
    ]

然后使用函數創建新表,並將該表轉換為panda數據框:

new_table = parse_table(table)
df = pd.DataFrame(new_table)
df =df.T
df.columns = df.iloc[0]
df = df[1:]
df

您可以使用zip()轉置表格中的值:

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser') # data is your table from question

rows = []
for tr in soup.select('tr'):
    rows.append([td.get_text(strip=True) for td in tr.select('th, td')])

rows = [*zip(*rows)]    # transpose values

for row in rows:
    print(''.join(r'{: <25}'.format(d) for d in row))

打印:

Part Number              Item Weight              Product Dimensions       Item model number        Item Package Quantity    Number of Handles        Batteries Included?      Batteries Required?      
3885SD                   1.83 pounds              9 x 6 x 3.5 inches       3885SD                   1                        1                        No                       No                       

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM