简体   繁体   中英

Using Nokogiri to parse HTML with xhtml:link tag?

I am using Nokogiri gem to parse HTML data.

$ gem list nokogiri

*** LOCAL GEMS ***

nokogiri (1.6.2.1)

Sample HTML is:

<html>
  <body>
    <xhtml:link>
      <div>
    Some content.
      </div>
    </xhtml:link>
  </body>
</html>

I am getting

>>  doc.xpath('/html/body/xhtml:link/div')
Nokogiri::XML::XPath::SyntaxError: Undefined namespace prefix: /html/body/xhtml:link/div
    from /var/lib/gems/1.9.1/gems/nokogiri-1.6.2.1/lib/nokogiri/xml/node.rb:159:in `evaluate'
    from /var/lib/gems/1.9.1/gems/nokogiri-1.6.2.1/lib/nokogiri/xml/node.rb:159:in `block in xpath'
    from /var/lib/gems/1.9.1/gems/nokogiri-1.6.2.1/lib/nokogiri/xml/node.rb:150:in `map'
    from /var/lib/gems/1.9.1/gems/nokogiri-1.6.2.1/lib/nokogiri/xml/node.rb:150:in `xpath'
    from (irb):95
    from /usr/bin/irb:12:in `<main>'

A full sample live HTML page can be found here

How can I avoid this error?

You need to add the XML Namespace ( xhtml in your example) to your root element so that Nokogiri recognizes it, unless you do that Nokogiri will ignore it and that error will appear.

You can do it this way:

<html xmlns:xhtml="http://www.w3.org/1999/xhtml">
    <body>
        <xhtml:link>
            <div>Some content.</div>
        </xhtml:link>
    </body>
</html>

See also this and this answers.

UPDATE based on comment

I've reviewed Nokogiri docs and found two workarounds, one is to pass the namespaces:

doc.xpath('/html/body/xhtml:link/div', 'xhtml' => 'http://www.w3.org/1999/xhtml')

Another is to manually add that namespace to the root document:

doc.root.add_namespace 'xhtml', 'http://www.w3.org/1999/xhtml'
doc.xpath('/html/body/xhtml:link/div')

While both ways do silent the error, the query in both cases just returns an empty array for me, unlike what happens if the xmlns attribute was originally included in the document.

You can ignore namespaces, if you are sure there are no unprefixed elements with the same name in the same context. Namespaces affect element and attribute names . If you select them using node() , or * you can test for the local-name() in a predicate without having to deal with namespaces.

In your example, you can select the xhtml:link element by selecting all elements in the context of body , and then restricting the result set to only those which have a local-name equal to link :

doc.xpath('/html/body/*[local-name()="link"]/div')

You might select unwanted HTML <link> elements if they occur in the body (they should never be there, but HTML parsers don't care if they are). But if they occur, they should be empty elements. There will never be one with a <div> inside, so you're safe.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM