简体   繁体   English

如何从 Python 中的 HTML 中提取多个表

[英]How to extract multiple table from HTML in Python

I want to extract all data of security bulletin table from html https://helpx.adobe.com/security/products/dreamweaver/apsb21-13.html .我想从 html https://helpx.adobe.com/security/products/dreamweaver/apsb21-13.ZFC35FDC70D5FC69D2698Z83A8中提取安全公告表的所有数据Based on my code, I only can extract the data in the table one by one.根据我的代码,我只能将表中的数据一一提取出来。 The code cannot extract the overall data from the table.该代码无法从表中提取整体数据。

This is my code这是我的代码

soup = BeautifulSoup(html_content, "lxml")
print(soup.prettify())
gdp = soup.find_all("table")

table = gdp[0]
body = table.find_all("tr")
head = body[0]
body_rows = body[1:] 

headings = []
for item in head.find_all("td"): 
    item = (item.text).rstrip("\n")
    headings.append(item)

all_rows = [] # will be a list for list for all rows
for row_num in range(len(body_rows)): # A row at a time
    row = [] # this will old entries for one row
    for row_item in body_rows[row_num].find_all("td"): 
        aa = re.sub("(\xa0)|(\n)|,","",row_item.text)
        row.append(aa)
    all_rows.append(row)
df = pd.DataFrame(data=all_rows,columns=headings)
df.head()

df = pd.DataFrame(data=all_rows,columns=headings)
df.to_csv('C:/Users//AdobeAir-APSB16-23 Security Update Available for Adobe AIR.csv')
df.head()

The output of the code is代码的output是

Bulletin ID Date Published  Priority
0   APSB21-13   February 09 2021    3

For this code, I imported library such as Beautifulsoup, requests, pandas and re.对于此代码,我导入了诸如 Beautifulsoup、requests、pandas 和 re 之类的库。 Hope anyone can help me on how to extract the data in the table all at once and can be converted into csv format.希望任何人都可以帮助我如何一次提取表中的数据并可以转换为 csv 格式。 Thank you.谢谢你。

You can make pandas do the heavy-lifting for you with read_html :您可以使用 read_html 让pandas为您完成read_html的工作:

url = 'https://helpx.adobe.com/security/products/dreamweaver/apsb21-13.html'
dfs = pd.read_html(url, header=0)
dfs[1]

Output: Output:

             Product  Affected Versions           Platform
0  Adobe Dreamweaver               20.2  Windows and macOS
1  Adobe Dreamweaver               21.0  Windows and macOS

PS It outputs a list of all tables found in the HTML. PS 它输出在 HTML 中找到的所有表的列表。 For example, dfs[0] is the first table:例如, dfs[0]是第一个表:

  Bulletin ID     Date Published  Priority
0   APSB21-13  February 09, 2021         3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM