简体   繁体   English

如何修改熊猫的Read_html用户代理?

[英]How to modify Pandas's Read_html user-agent?

I'm trying to scrape English football stats from various html tables via the Transfetmarkt website using the pandas.read_html() function. 我正在尝试使用pandas.read_html()函数通过Transfetmarkt网站从各种html表格中抓取英语足球统计数据。

Example: 例:

import pandas as pd
url = r'http://www.transfermarkt.co.uk/en/premier-league/gegentorminuten/wettbewerb_GB1.html'
df = pd.read_html(url)

However this code generates a "ValueError: Invalid URL" error. 但是,此代码会生成“ ValueError:无效的URL”错误。

I then attempted to parse the same website using the urllib2.urlopen() function. 然后,我尝试使用urllib2.urlopen()函数解析同一网站。 This time i got a "HTTPError: HTTP Error 404: Not Found". 这次我收到“ HTTPError:HTTP错误404:未找到”。 After the usual trial and error fault finding, it turns that the urllib2 header presents a python like agent to the webserver, which i presumed it doesn't recognize. 经过通常的试验和错误故障查找后,结果表明urllib2标头向网络服务器提供了类似python的代理,我认为它无法识别。

Now if I modify urllib2's agent and read its contents using beautifulsoup, i'm able to read the table without a problem. 现在,如果我修改urllib2的代理并使用beautifulsoup读取其内容,那么我可以毫无问题地读取表。

Example: 例:

from BeautifulSoup import BeautifulSoup
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = r'http://www.transfermarkt.co.uk/en/premier-league/gegentorminuten/wettbewerb_GB1.html'
response = opener.open(url)
html = response.read()
soup = BeautifulSoup(html)
table = soup.find("table")

How do I modify pandas's urllib2 header to allow python to scrape this website? 如何修改熊猫的urllib2标头以允许python抓取此网站?

Thanks 谢谢

Currently you cannot. 目前您不能。 Relevant piece of code: 相关代码段:

if _is_url(io): # io is the url
    try:
        with urlopen(io) as url:
            raw_text = url.read()
    except urllib2.URLError:
        raise ValueError('Invalid URL: "{0}"'.format(io))

As you see, it just passes the url to urlopen and reads the data. 如您所见,它只是将url传递给urlopen并读取数据。 You can file an issue requesting this feature, but I assume you don't have time to wait for it to be solved so I would suggest using BeautifulSoup to parse the html data and then load it into a DataFrame. 您可以提出请求此功能的问题,但是我想您没有时间等待它解决,因此,我建议使用BeautifulSoup解析html数据,然后将其加载到DataFrame中。

import urllib2

url = 'http://www.transfermarkt.co.uk/en/premier-league/gegentorminuten/wettbewerb_GB1.html'
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(url)
tables = pd.read_html(response.read(), attrs={"class":"tabelle_grafik"})[0]

Or if you can use requests : 或者,如果您可以使用requests

tables = pd.read_html(requests.get(url,
                                   headers={'User-agent': 'Mozilla/5.0'}).text,
                      attrs={"class":"tabelle_grafik"})[0]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM