[英]Directly Scraping HTML table using beautifulsoup?
Is there any direct way to scrape HTML table?有没有直接的方法来抓取 HTML 表格? It would be great if we give the class of HTML table and it provides the results?
如果我们给出 HTML table 的类并提供结果,那会很棒吗?
For example, I need to get table for this URL例如,我需要获取此URL 的表
I can use this procedure but I need a clean or direct solution我可以使用这个程序,但我需要一个干净或直接的解决方案
Well, then try this:好吧,那么试试这个:
import requests
import pandas as pd
url = "https://buchholz-stadtwerke.de/wasseranalyse.html"
df = pd.read_html(requests.get(url).text, flavor="bs4")
df = pd.concat(df)
df.to_csv("data.csv", index=False)
print(df)
Output:输出:
[ Parameter Einheit Grenzwert Messwert, Februar 2020
0 Wassertemperatur °C NaN 98
1 Leitfähigkeit (25°) µS/cm 2790 302
2 Sauerstoff (elektrochem.) mg/l NaN 109
3 pH-Wert NaN 6,5 bis 9,5 806
4 Sättigungsindex NaN NaN 001
5 Karbonathärte (dH°) °dH NaN 454
6 Gesamthärte (dH°) °dH NaN 645
7 Härtebereich NaN NaN weich
8 Calcitlösekapazität mg/l 5 -01
and so on...
Also, this spits out a .csv
file with the data from the table.此外,这会输出一个包含表中数据的
.csv
文件。
EDIT:编辑:
This sort of feels like a hack, but it works.这种感觉就像一个黑客,但它的工作原理。 Based on the comment and the URL, you can loop over the tables from the
df
and split them up in separate files.根据注释和 URL,您可以遍历
df
的表并将它们拆分为单独的文件。
import requests
import pandas as pd
url = "https://www.swd-ag.de/energie-wasser/wasser/trinkwasseranalyse/"
df = pd.read_html(io=requests.get(url).text, flavor="bs4")
for index, table in enumerate(df, start=1):
table.to_csv(f"table_{index}.csv", index=False)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.