简体   繁体   中英

Parsing HTML with a weird encoding with Nokogiri

I can't use XPath because the encoding gets weird. I hoped you could help me out of this trouble.

require "Nokogiri"
require "open-uri"
link = "http://www.arla.dk/Services/SearchService.asmx/RecipeResult?q=allRecipe&paging=6&include=&exclude=&area=recipeSearch&languageBranch=da"
doc = Nokogiri::HTML(open(link))
doc.xpath("//h2")

The xpath method returns an empty array. It looks like the document has not been parsed correct. I think it is due to the file being parsed contains the encoded characters:

<strong>Frokost til 8</strong>
<ul><li class='ingHeading'><strong><b>Flade

The response is XML so first parse it with Nokogiri::XML:

xml = Nokogiri::XML open(link)

then the first string contains some HTML so parse that with Nokogiri::HTML

doc = Nokogiri::HTML xml.at('string').text

Now you can do your search:

doc.xpath '//h2'

As stated above, the issue is that the HTML is encoded, which is why you are seeing escape sequences; For example, &lt; instead of < . To get around it, unescape the HTML.

" How do I encode/decode HTML entities in Ruby? basically suggests using htmlentities .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM