简体繁体 English

使用Python进行HTML解析（HTML与完整网站）

[英]HTML Parsing with Python (HTML vs. complete website)

原文 2017-04-09 01:40:26 3 1 python/ html/ parsing/ urllib

I'm trying to parse html from a website that contains information about train tickets and there prices (source below), however I'm having an issue getting back all the html from the website when I use urllib to request the html. 我正在尝试从包含有关火车票和那里的价格信息的网站解析html（以下来源），但是当我使用urllib请求html时，从网站取回所有html时遇到了问题。

What I need is the price per ticket which doesn't seem to appear when I used urllib to request the html. 我需要的是每张票的价格，当我使用urllib请求html时似乎没有出现。 After doing some investigative work, I determined that if I save the webpage with chrome and select "HTML only", I don't get the price, however if I select "Complete WebPage," I do. 在进行了一些调查工作之后，我确定如果我用chrome保存网页并选择“仅HTML”，那么我不会获得价格，但是如果我选择“完整网页”，我会知道。 Is there anyway to view the HTML that I get when I download the "Complete Webpage" and use that in python. 无论如何，当我下载“完整网页”并在python中使用它时，是否可以查看获得的HTML。 Or is there a way to automate the downloading of the complete webpage and use the downloaded files to parse in python. 或者有没有一种方法可以自动下载整个网页，并使用下载的文件在python中进行解析。

Thanks, George 谢谢乔治

https://www.raileurope.com/en/us/point_to_point/ptp_results.htm?execution=e3s1&resultId=147840746&cobrand=public&saleCountry=us&resultId=147840746&cobrand=public&saleCountry=us&itemId=-1&fn=fsRequest&cobrand=public&c=USD&roundtrip=0&isAtocRequest=0&georequest=1&lang=en&route-type=0&from0=paris&to0=amsterdam&deptDate0=06%2F07%2F2017&time0=8&pass-question-radio=1&nCountries=&selCountry1=&selCountry2=&selCountry3=&selCountry4=&selCountry5=&familyId=&p=0&additionalTraveler0=adult&additionalTravelerAge0=&paxIds=&nA=1&nY=0&nC=0&nS=0 https://www.raileurope.com/en/us/point_to_point/ptp_results.htm?execution=e3s1&resultId=147840746&cobrand=public&saleCountry=us&resultId=147840746&cobrand=public&saleCountry=us&itemId=-1&fn=fsRequest&cobrand=public&c=USD&roundtrip=0&isAtocRequest=0&georequest=1&lang = EN＆路线型= 0＆from0 =巴黎及到0 =阿姆斯特丹＆deptDate0 = 06％2F07％2F2017＆时间0 = 8＆传递问题无线电= 1＆nCountries =＆selCountry1 =＆selCountry2 =＆selCountry3 =＆selCountry4 =＆selCountry5 =＆FAMILYID = p = 0时＆additionalTraveler0 =成人＆additionalTravelerAge0 =＆paxIds = NA = 1＆NY = 0＆NC = 0＆纳秒= 0

1 个解决方案

Take a look at selenium 看看硒
Since the website is rendered by JS, you will have to use a webdriver to simulate the "Click". 由于网站是由JS呈现的，因此您将必须使用网络驱动程序来模拟“点击”。
You will need a crawler instead of a simple scraper 您将需要一个履带而不是一个简单的刮刀