简体   繁体   English

使用Python进行HTML解析(HTML与完整网站)

[英]HTML Parsing with Python (HTML vs. complete website)

I'm trying to parse html from a website that contains information about train tickets and there prices (source below), however I'm having an issue getting back all the html from the website when I use urllib to request the html. 我正在尝试从包含有关火车票和那里的价格信息的网站解析html(以下来源),但是当我使用urllib请求html时,从网站取回所有html时遇到了问题。

What I need is the price per ticket which doesn't seem to appear when I used urllib to request the html. 我需要的是每张票的价格,当我使用urllib请求html时似乎没有出现。 After doing some investigative work, I determined that if I save the webpage with chrome and select "HTML only", I don't get the price, however if I select "Complete WebPage," I do. 在进行了一些调查工作之后,我确定如果我用chrome保存网页并选择“仅HTML”,那么我不会获得价格,但是如果我选择“完整网页”,我会知道。 Is there anyway to view the HTML that I get when I download the "Complete Webpage" and use that in python. 无论如何,当我下载“完整网页”并在python中使用它时,是否可以查看获得的HTML。 Or is there a way to automate the downloading of the complete webpage and use the downloaded files to parse in python. 或者有没有一种方法可以自动下载整个网页,并使用下载的文件在python中进行解析。

Thanks, George 谢谢乔治

https://www.raileurope.com/en/us/point_to_point/ptp_results.htm?execution=e3s1&resultId=147840746&cobrand=public&saleCountry=us&resultId=147840746&cobrand=public&saleCountry=us&itemId=-1&fn=fsRequest&cobrand=public&c=USD&roundtrip=0&isAtocRequest=0&georequest=1&lang=en&route-type=0&from0=paris&to0=amsterdam&deptDate0=06%2F07%2F2017&time0=8&pass-question-radio=1&nCountries=&selCountry1=&selCountry2=&selCountry3=&selCountry4=&selCountry5=&familyId=&p=0&additionalTraveler0=adult&additionalTravelerAge0=&paxIds=&nA=1&nY=0&nC=0&nS=0 https://www.raileurope.com/en/us/point_to_point/ptp_results.htm?execution=e3s1&resultId=147840746&cobrand=public&saleCountry=us&resultId=147840746&cobrand=public&saleCountry=us&itemId=-1&fn=fsRequest&cobrand=public&c=USD&roundtrip=0&isAtocRequest=0&georequest=1&lang = EN&路线型= 0&from0 =巴黎及到0 =阿姆斯特丹&deptDate0 = 06%2F07%2F2017&时间0 = 8&传递问题无线电= 1&nCountries =&selCountry1 =&selCountry2 =&selCountry3 =&selCountry4 =&selCountry5 =&FAMILYID = p = 0时&additionalTraveler0 =成人&additionalTravelerAge0 =&paxIds = NA = 1&NY = 0&NC = 0&纳秒= 0

Take a look at selenium 看看
Since the website is rendered by JS, you will have to use a webdriver to simulate the "Click". 由于网站是由JS呈现的,因此您将必须使用网络驱动程序来模拟“点击”。
You will need a crawler instead of a simple scraper 您将需要一个履带而不是一个简单的刮刀

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM