使用python和lxml从网站获取html源代码

Question

I am a beginner of python and trying to create a procedure with Python 2.7 which retrieves the betting odds from the following web sites.我是 Python 的初学者，并尝试使用 Python 2.7 创建一个程序，该程序从以下网站检索投注赔率。

English Version web site: http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=24-09-2015&venue=hv&raceno=1&lang=en英文版网址： http : //bet.hkjc.com/racing/pages/odds_wp.aspx?date=24-09-2015&venue=hv&raceno=1&lang=en

Chinese Version web site: bet.hkjc.com/racing/pages/odds_wp.aspx?date=24-09-2015&venue=hv&raceno=1中文版网址： bet.hkjc.com/racing/pages/odds_wp.aspx?date=24-09-2015&venue=hv&raceno=1

The data that i want to retrieve is marked in the following image file https://na.cx/i/Bz873x.jpg我要检索的数据在下面的图像文件中标记了https://na.cx/i/Bz873x.jpg

The procedure works well in other web site (eg reddit or lxml.de/parsing.html).该过程在其他网站（例如 reddit 或 lxml.de/parsing.html）中运行良好。 But I don't know why the procedure retrieved a different html code that I've retrieved by using Chrome.但我不知道为什么该过程检索了我使用 Chrome 检索到的不同 html 代码。

from urllib2 import urlopen
from lxml import etree

# print out the sources code of the web site
# work properly on other web sites (e.g. reddit.com or lxml.de/parsing.html)
# but having problem on the betting web site
url = 'http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=24-09-2015&venue=hv&raceno=1'
tree = etree.HTML(urlopen(url).read())
print(etree.tostring(tree, pretty_print=True))

# printing the first horse name in chinese version web site (Doesn't work)
horse_name = tree.xpath('//*[@id="detailWPTable"]/table/tbody/tr[2]/td[3]/a/span/text()')
print horse

After running the above procedure, I found that the html code retrieved by Python is different from the html code that I retrieved by using Chrome Function - [View Sources] or [Open Developer Tools].运行上述程序后，我发现Python检索到的html代码与我使用Chrome功能-【查看源代码】或【打开开发者工具】检索到的html代码不同。

My question is我的问题是

How can I get the correct html code (Same code as Chrome - View Sources) by using python?如何使用 python 获取正确的 html 代码（与 Chrome 相同的代码 - 查看源代码）？

Thanks :)谢谢：）

Answer 1

It is probably because your user-agent is set differently and because some scripts on the page are not executed.这可能是因为您的用户代理设置不同，并且页面上的某些脚本未执行。 You can set the first element in the HTTP request headers, but most importantly you need to render the web page using a headless browser .您可以设置 HTTP 请求标头中的第一个元素，但最重要的是您需要使用无头浏览器呈现网页。

A good example of such a framework working in Python is Selenium .在 Python 中工作的这种框架的一个很好的例子是Selenium 。

使用python和lxml从网站获取html源代码

问题描述

1 个解决方案

解决方案1
0 2020-01-14 12:44:27

使用python和lxml从网站获取html源代码

问题描述

1 个解决方案

解决方案1 0 2020-01-14 12:44:27

解决方案1
0 2020-01-14 12:44:27