简体   繁体   English

熊猫read_html ValueError:找不到表

[英]pandas read_html ValueError: No tables found

I am trying to scrap the historical weather data from the " https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html " weather underground page. 我正在尝试从“ https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html ”天气地下页面中抓取历史天气数据。 I have the following code: 我有以下代码:

import pandas as pd 

page_link = 'https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html'
df = pd.read_html(page_link)

I have the following response: 我有以下回应:

Traceback (most recent call last):
 File "weather_station_scrapping.py", line 11, in <module>
  result = pd.read_html(page_link)
 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 987, in read_html
 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 815, in _parse raise_with_traceback(retained)
 File "/anaconda3/lib/python3.6/site-packages/pandas/compat/__init__.py", line 403, in raise_with_traceback
  raise exc.with_traceback(traceback)
ValueError: No tables found

Although, this page clearly has a table but it is not being picked by the read_html. 虽然,此页面显然有一个表,但read_html并未选择它。 I have tried using Selenium so that the page can be loaded before I read it. 我尝试使用Selenium,以便在阅读页面之前可以加载该页面。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
elem = driver.find_element_by_id("history_table")

head = elem.find_element_by_tag_name('thead')
body = elem.find_element_by_tag_name('tbody')

list_rows = []

for items in body.find_element_by_tag_name('tr'):
    list_cells = []
    for item in items.find_elements_by_tag_name('td'):

Now, the problem is that it cannot find "tr". 现在,问题在于它找不到“ tr”。 I would appreciate any suggestions. 我将不胜感激任何建议。

You can use requests and avoid opening browser. 您可以使用requests ,避免打开浏览器。

You can get current conditions by using: 您可以使用以下方法获取当前状况:

https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15 https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15

and strip of 'jQuery1720724027235122559_1542743885014(' from the left and ')' from the right. 'jQuery1720724027235122559_1542743885014('从左到右')'条带。 Then handle the json string. 然后处理json字符串。

You can get summary and history by calling the API with the following 您可以通过以下代码调用API来获取摘要和历史记录

https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery1720724027235122559_1542743885015&_=1542743886276 https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery1720724027235122559_1542743885015&_=1542743886276

You then need to strip 'jQuery1720724027235122559_1542743885015(' from the front and ');' 然后,您需要从前面和');'剥离'jQuery1720724027235122559_1542743885015(' ');' from the right. 从右边。 You then have a JSON string you can parse. 然后,您可以解析一个JSON字符串。

Sample of JSON: JSON样本:

You can find these URLs by using F12 dev tools in browser and inspecting the network tab for the traffic created during page load. 您可以通过使用浏览器中的F12开发工具并检查“网络”标签中的页面加载期间创建的流量来找到这些URL。

An example for current , noting there seems to be a problem with nulls in the JSON so I am replacing with "placeholder" : current的示例,注意到JSON中的nulls似乎存在问题,因此我将其替换为"placeholder"

import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup

url = 'https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
s = s.replace('null','"placeholder"')
data= json.loads(s)
data = json_normalize(data)
df = pd.DataFrame(data)

Here's a solution using selenium for browser automation 这是使用硒实现浏览器自动化的解决方案

from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(chromedriver)


Time    Temperature Dew Point   Humidity    Wind    Speed   Gust    Pressure  Precip. Rate. Precip. Accum.  UV  Solar
0   12:02 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²
1   12:07 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²
2   12:12 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²
3   12:17 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²
4   12:22 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

Editing with breakdown of exactly what's happening, since the above one-liner is actually not very good self-documenting code: 由于上面的一类代码实际上并不是很好的自记录代码,因此请按实际情况进行细化的编辑:

After setting up the driver, we select the table with its ID value (Thankfully this site actually uses reasonable and descriptive IDs) 设置驱动程序后,我们选择具有其ID值的表(感谢此站点实际上使用了合理的描述性ID)


Then, from that element, we get the HTML instead of the web driver element object 然后,从该元素中获取HTML而不是Web驱动程序元素对象


We use pandas to parse the html 我们使用熊猫来解析html


From the docs : 文档

"read_html returns a list of DataFrame objects, even if there is only a single table contained in the HTML content" “即使HTML内容中仅包含一个表,read_html也会返回DataFrame对象的列表”

So we index into that list with the only table we have, at index zero 因此,我们使用唯一的表索引到该列表,索引为零


声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM