简体   繁体   English

需要使用 python 清理 web 抓取的数据

[英]Need to clean web scraped data using python

I am trying to write a code for scraping data from http://goldpricez.com/gold/history/lkr/years-3 .我正在尝试编写用于从http://goldpricez.com/gold/history/lkr/years-3抓取数据的代码。 The code that I have written follows below.我写的代码如下。 The code works and gives me my intended results.该代码有效并给了我预期的结果。

import pandas as pd

url = "http://goldpricez.com/gold/history/lkr/years-3"

df = pd.read_html(url)

print(df)

But result is with some unwanted data and I want only the data in the table.但结果是一些不需要的数据,我只想要表中的数据。 Please can some help me with this.请帮我解决这个问题。

Here I have added the image of the output with unwanted data (red circled)在这里,我添加了带有不需要数据的 output 的图像(红色圆圈)

    import pandas as pd



   url = "http://goldpricez.com/gold/history/lkr/years-3"

   df = pd.read_html(url)# this will give you a list of dataframes from html

  print(df[3])

Use BeautifulSoup for this the below code works perfectly为此使用 BeautifulSoup ,下面的代码可以完美运行

import requests
from bs4 import BeautifulSoup
url = "http://goldpricez.com/gold/history/lkr/years-3"
r = requests.get(url)
s = BeautifulSoup(r.text, "html.parser")
data = s.find_all("td")
data = data[11:]
for i in range(0, len(data), 2):
    print(data[i].text.strip(), "      ", data[i+1].text.strip())

This other advantage of using BeautifulSoup is that it is way faster that your code使用 BeautifulSoup 的另一个优点是您的代码速度更快

The way you used .read_html will return a list of all tables.您使用.read_html的方式将返回所有表格的列表。 Your table is at index 3您的表位于索引 3

import pandas as pd

url = "http://goldpricez.com/gold/history/lkr/years-3"

df = pd.read_html(url)[3]

print(df)

.read_html makes a call to the URL, and uses BeautifulSoup to parse the response under the hood. .read_html调用 URL,并使用 BeautifulSoup 在后台解析响应。 You can change the parse, the name of the table, pass header as you would in .read_csv .您可以更改解析,表的名称,传递 header 就像在.read_csv中一样。 Check .read_html for more details.检查.read_html了解更多详情。

For speed, you can use lxml eg pd.read_html(url, flavor='lxml')[3] .为了速度,您可以使用lxml例如pd.read_html(url, flavor='lxml')[3] By default, html5lib , which is the second slowest, is used.默认情况下,使用第二慢的html5lib Another flavor is html.parser .另一种风格是html.parser It is the slowest of them all.它是所有这些中最慢的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM