简体   繁体   English

使用Nokogiri解析JavaScript隐藏的HTML

[英]Using Nokogiri to parse JavaScript hidden HTML

I'm trying to use Nokogiri to parse this ASCAP website to retrieve some song/artist information. 我正在尝试使用Nokogiri解析此ASCAP网站以检索一些歌曲/艺术​​家信息。 Here's an example of what I'd want to query 这是我要查询的示例

https://mobile.ascap.com/aceclient/AceClient/#ace/writer/1628840/JAY%20Z https://mobile.ascap.com/aceclient/AceClient/#ace/writer/1628840/JAY%20Z

I can't seem to access the DOM properly because the source seems to be hidden behind some kind of JavaScript. 我似乎无法正确访问DOM,因为源似乎隐藏在某种JavaScript后面。 I'm pretty new to web scraping so it has been pretty difficult trying to find a way to do this. 我对网页抓取还很陌生,因此尝试找到一种方法很难。 I tried using Charles to see if data was being drawn from another site, and have been using XHelper to generate accurate XPath queries. 我尝试使用Charles来查看数据是否是从另一个站点提取的,并且一直在使用XHelper生成准确的XPath查询。

This returns nil, where it should return "1, 2 YA'LL" 这将返回nil,它应该返回“ 1、2 YA'LL”

page = Nokogiri::HTML(open('https://mobile.ascap.com/aceclient/AceClient/#ace/writer/1628840/JAY%20Z'))

puts page.xpath('/html/body/div[@id="desktopSearch"]/div[@id='ace']/div[@id="aceMain"]/div[@id="aceResults"]/ul[@id="ace_list"]/li[@class="nav"][1]/div[@class="workTitle"]').text

Step #1 when spidering/scraping, is to turn off the JavaScript in your browser, then look at a page. 抓取/抓取时的步骤1是关闭浏览器中的JavaScript, 然后查看页面。 What you see at that point is what Nokogiri sees. 那时您所看到的就是Nokogiri所看到的。 If the data you want is visible, then odds are really good you can get at it with a parser. 如果所需的数据可见,那么使用解析器就可以解决问题。

At that point, do NOT rely on a browser's XPath or CSS selector list seen when you inspect an element to show you the path to the node(s) you want. 那时,不要依赖于在检查元素以显示所需节点的路径时看到的浏览器的XPath或CSS选择器列表。 Browsers do a lot of fix-ups when displaying a page, and the source view usually reflects those, including displaying data retrieved dynamically. 浏览器在显示页面时会做很多修复工作,而源视图通常会反映出这些问题,包括显示动态检索的数据。 In other words, the browser is lying to you about what it originally retrieved from a page. 换句话说,浏览器对您撒谎的是它最初从页面中检索到的内容。 To work around that, use wget , curl or nokogiri http://some_URL at the command-line to retrieve the original page, then locate the node you want. 要解决此问题, nokogiri http://some_URL在命令行中使用wgetcurlnokogiri http://some_URL检索原始页面,然后找到所需的节点。

If you don't see the node you want, then you're going to need to use other tools, such as something from the Watir suite, which lets you drive a browser which understands JavaScript. 如果没有找到所需的节点,则将需要使用其他工具,例如Watir套件中的某些工具,该工具可以驱动能够理解JavaScript的浏览器。 A browser can retrieve a page, interpret the JavaScript, and retrieve any dynamic page content. 浏览器可以检索页面,解释JavaScript并检索任何动态页面内容。 Then you should be able to get at the markup and pass it to Nokogiri. 然后,您应该能够获得标记并将其传递给Nokogiri。

Used the google inspector tools to log the XMLHTTPRequests and was easily able to figure out where the data was actually being loaded from. 使用google inspector工具来记录XMLHTTPRequests,并且能够轻松找出实际从何处加载数据。 Thanks to @NickVeys! 感谢@NickVeys!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM