需要使用 python 清理 web 抓取的数据

Question

I am trying to write a code for scraping data from http://goldpricez.com/gold/history/lkr/years-3 .我正在尝试编写用于从http://goldpricez.com/gold/history/lkr/years-3抓取数据的代码。 The code that I have written follows below.我写的代码如下。 The code works and gives me my intended results.该代码有效并给了我预期的结果。

import pandas as pd

url = "http://goldpricez.com/gold/history/lkr/years-3"

df = pd.read_html(url)

print(df)

But result is with some unwanted data and I want only the data in the table.但结果是一些不需要的数据，我只想要表中的数据。 Please can some help me with this.请帮我解决这个问题。

Here I have added the image of the output with unwanted data (red circled)在这里，我添加了带有不需要数据的 output 的图像（红色圆圈）

Answer 1

    import pandas as pd



   url = "http://goldpricez.com/gold/history/lkr/years-3"

   df = pd.read_html(url)# this will give you a list of dataframes from html

  print(df[3])

Answer 2

Use BeautifulSoup for this the below code works perfectly为此使用 BeautifulSoup ，下面的代码可以完美运行

import requests
from bs4 import BeautifulSoup
url = "http://goldpricez.com/gold/history/lkr/years-3"
r = requests.get(url)
s = BeautifulSoup(r.text, "html.parser")
data = s.find_all("td")
data = data[11:]
for i in range(0, len(data), 2):
    print(data[i].text.strip(), "      ", data[i+1].text.strip())

This other advantage of using BeautifulSoup is that it is way faster that your code使用 BeautifulSoup 的另一个优点是您的代码速度更快

Answer 3

The way you used .read_html will return a list of all tables.您使用.read_html的方式将返回所有表格的列表。 Your table is at index 3您的表位于索引 3

import pandas as pd

url = "http://goldpricez.com/gold/history/lkr/years-3"

df = pd.read_html(url)[3]

print(df)

.read_html makes a call to the URL, and uses BeautifulSoup to parse the response under the hood. .read_html调用 URL，并使用 BeautifulSoup 在后台解析响应。 You can change the parse, the name of the table, pass header as you would in .read_csv .您可以更改解析，表的名称，传递 header 就像在.read_csv中一样。 Check .read_html for more details.检查.read_html了解更多详情。

For speed, you can use lxml eg pd.read_html(url, flavor='lxml')[3] .为了速度，您可以使用lxml例如pd.read_html(url, flavor='lxml')[3] 。 By default, html5lib , which is the second slowest, is used.默认情况下，使用第二慢的html5lib 。 Another flavor is html.parser .另一种风格是html.parser 。 It is the slowest of them all.它是所有这些中最慢的。

需要使用 python 清理 web 抓取的数据

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-06-14 05:27:57

解决方案2
0 2020-06-14 06:29:13

解决方案3
0 2020-06-14 07:47:22

需要使用 python 清理 web 抓取的数据

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-06-14 05:27:57

解决方案2 0 2020-06-14 06:29:13

解决方案3 0 2020-06-14 07:47:22

解决方案1
1 已采纳 2020-06-14 05:27:57

解决方案2
0 2020-06-14 06:29:13

解决方案3
0 2020-06-14 07:47:22