Parsing HTML with a weird encoding with Nokogiri

Question

I can't use XPath because the encoding gets weird. I hoped you could help me out of this trouble.

require "Nokogiri"
require "open-uri"
link = "http://www.arla.dk/Services/SearchService.asmx/RecipeResult?q=allRecipe&paging=6&include=&exclude=&area=recipeSearch&languageBranch=da"
doc = Nokogiri::HTML(open(link))
doc.xpath("//h2")

The xpath method returns an empty array. It looks like the document has not been parsed correct. I think it is due to the file being parsed contains the encoded characters:

&lt;strong&gt;Frokost til 8&lt;/strong&gt;
&lt;ul&gt;&lt;li class='ingHeading'&gt;&lt;strong&gt;&lt;b&gt;Flade

Answer 1

The response is XML so first parse it with Nokogiri::XML:

xml = Nokogiri::XML open(link)

then the first string contains some HTML so parse that with Nokogiri::HTML

doc = Nokogiri::HTML xml.at('string').text

Now you can do your search:

doc.xpath '//h2'

Answer 2

As stated above, the issue is that the HTML is encoded, which is why you are seeing escape sequences; For example, < instead of < . To get around it, unescape the HTML.

" How do I encode/decode HTML entities in Ruby? basically suggests using htmlentities .

Parsing HTML with a weird encoding with Nokogiri

Question

2 answers

solution1
1 ACCPTED 2012-10-30 10:09:49

solution2
0 2012-10-30 10:00:46

Parsing HTML with a weird encoding with Nokogiri

Question

2 answers

solution1 1 ACCPTED 2012-10-30 10:09:49

solution2 0 2012-10-30 10:00:46

solution1
1 ACCPTED 2012-10-30 10:09:49

solution2
0 2012-10-30 10:00:46