熊猫read_html ValueError：找不到表

Question

I am trying to scrap the historical weather data from the " https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html " weather underground page. 我正在尝试从“ https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html ”天气地下页面中抓取历史天气数据。 I have the following code: 我有以下代码：

import pandas as pd 

page_link = 'https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html'
df = pd.read_html(page_link)
print(df)

I have the following response: 我有以下回应：

Traceback (most recent call last):
 File "weather_station_scrapping.py", line 11, in <module>
  result = pd.read_html(page_link)
 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 987, in read_html
  displayed_only=displayed_only)
 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 815, in _parse raise_with_traceback(retained)
 File "/anaconda3/lib/python3.6/site-packages/pandas/compat/__init__.py", line 403, in raise_with_traceback
  raise exc.with_traceback(traceback)
ValueError: No tables found

Although, this page clearly has a table but it is not being picked by the read_html. 虽然，此页面显然有一个表，但read_html并未选择它。 I have tried using Selenium so that the page can be loaded before I read it. 我尝试使用Selenium，以便在阅读页面之前可以加载该页面。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get("https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html")
elem = driver.find_element_by_id("history_table")

head = elem.find_element_by_tag_name('thead')
body = elem.find_element_by_tag_name('tbody')

list_rows = []

for items in body.find_element_by_tag_name('tr'):
    list_cells = []
    for item in items.find_elements_by_tag_name('td'):
        list_cells.append(item.text)
    list_rows.append(list_cells)
driver.close()

Now, the problem is that it cannot find "tr". 现在，问题在于它找不到“ tr”。 I would appreciate any suggestions. 我将不胜感激任何建议。

Answer 1

You can use requests and avoid opening browser. 您可以使用requests ，避免打开浏览器。

You can get current conditions by using: 您可以使用以下方法获取当前状况：

https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15 https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15

and strip of 'jQuery1720724027235122559_1542743885014(' from the left and ')' from the right. 和'jQuery1720724027235122559_1542743885014('从左到右')'条带。 Then handle the json string. 然后处理json字符串。

You can get summary and history by calling the API with the following 您可以通过以下代码调用API来获取摘要和历史记录

https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery1720724027235122559_1542743885015&_=1542743886276 https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery1720724027235122559_1542743885015&_=1542743886276

You then need to strip 'jQuery1720724027235122559_1542743885015(' from the front and ');' 然后，您需要从前面和');'剥离'jQuery1720724027235122559_1542743885015(' ');' from the right. 从右边。 You then have a JSON string you can parse. 然后，您可以解析一个JSON字符串。

Sample of JSON: JSON样本：

You can find these URLs by using F12 dev tools in browser and inspecting the network tab for the traffic created during page load. 您可以通过使用浏览器中的F12开发工具并检查“网络”标签中的页面加载期间创建的流量来找到这些URL。

An example for current , noting there seems to be a problem with nulls in the JSON so I am replacing with "placeholder" : current的示例，注意到JSON中的nulls似乎存在问题，因此我将其替换为"placeholder" ：

import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup

url = 'https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
s = s.replace('null','"placeholder"')
data= json.loads(s)
data = json_normalize(data)
df = pd.DataFrame(data)
print(df)

Answer 2

Here's a solution using selenium for browser automation 这是使用硒实现浏览器自动化的解决方案

from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(chromedriver)
driver.implicitly_wait(30)

driver.get('https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html')
    df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0]

Time    Temperature Dew Point   Humidity    Wind    Speed   Gust    Pressure  Precip. Rate. Precip. Accum.  UV  Solar
0   12:02 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²
1   12:07 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²
2   12:12 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²
3   12:17 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²
4   12:22 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

Editing with breakdown of exactly what's happening, since the above one-liner is actually not very good self-documenting code: 由于上面的一类代码实际上并不是很好的自记录代码，因此请按实际情况进行细化的编辑：

After setting up the driver, we select the table with its ID value (Thankfully this site actually uses reasonable and descriptive IDs) 设置驱动程序后，我们选择具有其ID值的表（感谢此站点实际上使用了合理的描述性ID）

tab=driver.find_element_by_id("history_table")

Then, from that element, we get the HTML instead of the web driver element object 然后，从该元素中获取HTML而不是Web驱动程序元素对象

tab_html=tab.get_attribute('outerHTML')

We use pandas to parse the html 我们使用熊猫来解析html

tab_dfs=pd.read_html(tab_html)

From the docs : 从文档：

"read_html returns a list of DataFrame objects, even if there is only a single table contained in the HTML content" “即使HTML内容中仅包含一个表，read_html也会返回DataFrame对象的列表”

So we index into that list with the only table we have, at index zero 因此，我们使用唯一的表索引到该列表，索引为零

df=tab_dfs[0]

熊猫read_html ValueError：找不到表

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-11-20 20:19:35

解决方案2
1 2018-11-20 20:31:22

熊猫read_html ValueError：找不到表

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-11-20 20:19:35

解决方案2 1 2018-11-20 20:31:22

解决方案1
1 已采纳 2018-11-20 20:19:35

解决方案2
1 2018-11-20 20:31:22