简体   繁体   中英

Using Nokogiri to parse JavaScript hidden HTML

I'm trying to use Nokogiri to parse this ASCAP website to retrieve some song/artist information. Here's an example of what I'd want to query

https://mobile.ascap.com/aceclient/AceClient/#ace/writer/1628840/JAY%20Z

I can't seem to access the DOM properly because the source seems to be hidden behind some kind of JavaScript. I'm pretty new to web scraping so it has been pretty difficult trying to find a way to do this. I tried using Charles to see if data was being drawn from another site, and have been using XHelper to generate accurate XPath queries.

This returns nil, where it should return "1, 2 YA'LL"

page = Nokogiri::HTML(open('https://mobile.ascap.com/aceclient/AceClient/#ace/writer/1628840/JAY%20Z'))

puts page.xpath('/html/body/div[@id="desktopSearch"]/div[@id='ace']/div[@id="aceMain"]/div[@id="aceResults"]/ul[@id="ace_list"]/li[@class="nav"][1]/div[@class="workTitle"]').text

Step #1 when spidering/scraping, is to turn off the JavaScript in your browser, then look at a page. What you see at that point is what Nokogiri sees. If the data you want is visible, then odds are really good you can get at it with a parser.

At that point, do NOT rely on a browser's XPath or CSS selector list seen when you inspect an element to show you the path to the node(s) you want. Browsers do a lot of fix-ups when displaying a page, and the source view usually reflects those, including displaying data retrieved dynamically. In other words, the browser is lying to you about what it originally retrieved from a page. To work around that, use wget , curl or nokogiri http://some_URL at the command-line to retrieve the original page, then locate the node you want.

If you don't see the node you want, then you're going to need to use other tools, such as something from the Watir suite, which lets you drive a browser which understands JavaScript. A browser can retrieve a page, interpret the JavaScript, and retrieve any dynamic page content. Then you should be able to get at the markup and pass it to Nokogiri.

Used the google inspector tools to log the XMLHTTPRequests and was easily able to figure out where the data was actually being loaded from. Thanks to @NickVeys!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM