[英]Need to clean web scraped data using python
I am trying to write a code for scraping data from http://goldpricez.com/gold/history/lkr/years-3 .我正在尝试编写用于从http://goldpricez.com/gold/history/lkr/years-3抓取数据的代码。 The code that I have written follows below.
我写的代码如下。 The code works and gives me my intended results.
该代码有效并给了我预期的结果。
import pandas as pd
url = "http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)
print(df)
But result is with some unwanted data and I want only the data in the table.但结果是一些不需要的数据,我只想要表中的数据。 Please can some help me with this.
请帮我解决这个问题。
Here I have added the image of the output with unwanted data (red circled)在这里,我添加了带有不需要数据的 output 的图像(红色圆圈)
import pandas as pd
url = "http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)# this will give you a list of dataframes from html
print(df[3])
Use BeautifulSoup for this the below code works perfectly为此使用 BeautifulSoup ,下面的代码可以完美运行
import requests
from bs4 import BeautifulSoup
url = "http://goldpricez.com/gold/history/lkr/years-3"
r = requests.get(url)
s = BeautifulSoup(r.text, "html.parser")
data = s.find_all("td")
data = data[11:]
for i in range(0, len(data), 2):
print(data[i].text.strip(), " ", data[i+1].text.strip())
This other advantage of using BeautifulSoup is that it is way faster that your code使用 BeautifulSoup 的另一个优点是您的代码速度更快
The way you used .read_html
will return a list of all tables.您使用
.read_html
的方式将返回所有表格的列表。 Your table is at index 3您的表位于索引 3
import pandas as pd
url = "http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)[3]
print(df)
.read_html
makes a call to the URL, and uses BeautifulSoup to parse the response under the hood. .read_html
调用 URL,并使用 BeautifulSoup 在后台解析响应。 You can change the parse, the name of the table, pass header as you would in .read_csv
.您可以更改解析,表的名称,传递 header 就像在
.read_csv
中一样。 Check .read_html for more details.检查.read_html了解更多详情。
For speed, you can use lxml
eg pd.read_html(url, flavor='lxml')[3]
.为了速度,您可以使用
lxml
例如pd.read_html(url, flavor='lxml')[3]
。 By default, html5lib
, which is the second slowest, is used.默认情况下,使用第二慢的
html5lib
。 Another flavor is html.parser
.另一种风格是
html.parser
。 It is the slowest of them all.它是所有这些中最慢的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.