简体   繁体   English

为什么xpath在html标签外返回文本?

[英]why xpath is returning text outside html tags?

I am working with a document which have some text outside <html> tag. 我正在使用在<html>标记之外包含一些text的文档。 When I read data inside body it also returns the text which is not even in html tag. 当我读取体内的数据时,它还会返回甚至不在html标签中的文本。

page_text = Nokogiri::HTML(open(file_path)).xpath("//body").text
p page_text

Output: 输出:

"WARC/1.0\\nWARC-Type: response\\nWARC-Date: 2012-02-11T04:48:01Z\\nWARC-TREC-ID: clueweb12-0000tw-13-04988\\nWARC-IP-Address: 184.85.26.15\\nWARC-Payload-Digest: sha1:PNCB5NNAA766RLLISZ6ODV3FJZBCATKR\\nWARC-Target-URI: http://www.allchocolate.com/health/basics/\\nWARC-Record-ID: \\nContent-Type: application/http; msgtype=response\\nContent-Length: 14577\\n\\n\\n\\n\\n sample document\\n\\n\\n hello world\\n\\n"

Document: 文献:

WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-11T04:48:01Z
WARC-TREC-ID: clueweb12-0000tw-13-04988
WARC-IP-Address: 184.85.26.15
WARC-Payload-Digest: sha1:PNCB5NNAA766RLLISZ6ODV3FJZBCATKR
WARC-Target-URI: http://www.allchocolate.com/health/basics/
WARC-Record-ID: <urn:uuid:ff32c863-5066-4f51-802a-f31d4af074d5>
Content-Type: application/http; msgtype=response
Content-Length: 14577

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <title>sample document</title>
</head>
<body>
    hello world
</body>
</html>

Nokogiri is trying to parse the file contents as a HTML document, but it isn't a valid document. Nokogiri尝试将文件内容解析为HTML文档,但这不是有效的文档。 It is a text document that just happens to contain in it a HTML document. 它是一个文本文件,恰好包含HTML文件。 Of course Nokogiri doesn't know this, and it isn't able to pick out that part that is HTML by itself, so it tries to parse the whole thing. 当然,Nokogiri不知道这一点,并且它无法挑选出HTML本身,因此它试图解析整个内容。 Since it is not valid HTML, this produces errors. 由于它不是有效的HTML,因此会产生错误。

As it parses, Nokogiri attempts to fix these errors as best it can, but that doesn't work in this case, and results in the strange looking output you see here. 解析时,Nokogiri尝试尽最大可能修复这些错误,但是在这种情况下不起作用,并导致您在此处看到奇怪的输出。

In particular, when Nokogiri sees the text before the HTML, it assumes that it should be part of the HTML document body. 特别是,当Nokogiri在HTML之前看到该文本时,它将假定它应该是HTML文档主体的一部分。 So it creates and injects html and body elements into the document, before adding the text as a child of this body . 因此,在将文本添加为body的子级之前,它会创建htmlbody元素并将其注入到文档中。

Later it sees the actual <body> tag, but since it knows it already has a body element, and that there can only be one such element, it ignores it. 后来,它看到了实际的<body>标签,但是由于知道它已经具有一个body元素,并且只能有一个这样的元素,因此它将忽略它。

You need to make sure that you only provide valid HTML (or as close as you can to valid — the error correction can fix small things). 您需要确保仅提供有效的HTML(或尽可能接近有效的HTML,纠错可以解决小问题)。 You will probably need to pre-process your files in some way to remove the extra text at the beginning. 您可能需要以某种方式对文件进行预处理,以删除开头的多余文本。

Clearly leading text is a problem, but not trailing text. 显然,前导文本是一个问题,但末尾文本不是问题。 XML is a highly structured language, and applying an XML parser to HTML means at the very least that you have to have valid HTML. XML是一种高度结构化的语言,将XML解析器应用于HTML至少意味着您必须拥有有效的HTML。 If you don't have valid HTML, then you get whatever Nokogiri spits out. 如果您没有有效的HTML,那么您将得到Nokogiri吐出的任何东西。

It looks to me like Nokogiri wraps the whole thing in a default root node, then returns all the text nodes therein, essentially ignoring the //body xpath. 在我看来,Nokogiri将整个内容包装在默认的根节点中,然后返回其中的所有文本节点,实际上忽略了//body xpath。 Interestingly, if you wrap your text in a div and search for the xpath //div , no problems, so that might suggest a solution. 有趣的是,如果将文本包装在div并搜索xpath //div ,则不会出现问题,因此可能会提出解决方案。

It seems like Nokogiri considers //body to be equal to the root node. 看来Nokogiri认为//body等于根节点。 Ah! 啊! Maybe Nokogiri uses <body> for the root node. 也许Nokogiri使用<body>作为根节点。 Nope: the xpath /body//body doesn't work. 不,xpath /body//body不起作用。

Response to comment: 对评论的回应:

You could use a regex to search for the <body> tag then insert a div tag. 您可以使用正则表达式搜索<body>标签,然后插入div标签。 But searching html with a simple regex will be a fragile solution, and it won't work in all cases. 但是使用简单的正则表达式搜索html将是一个脆弱的解决方案,并且并非在所有情况下都有效。

By the way, you can see how Nokogiri handles text outside of tags by parsing a document that only has the text: hello world, then printing out all the nodes that Nokogiri finds: 顺便说一句,您可以通过解析仅包含text:hello world的文档,然后打印出Nokogiri找到的所有节点,来了解Nokogiri如何处理标签外的文本:

require 'nokogiri'

nodes = Nokogiri::HTML(open('html.html')).xpath('//*')

nodes.each do |node|
  puts node.name
end

--output:--
html
body
p

So Nokogiri wraps the text in three tags. 因此,Nokogiri将文本包装在三个标签中。

Or, better yet, you can parse your document and print it out as html: 或者,更好的是,您可以解析文档并将其打印为html:

require 'nokogiri'

doc = Nokogiri::HTML(open('./html.html'))
puts doc.to_html

--output:--
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html><body><p>WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-11T04:48:01Z
WARC-TREC-ID: clueweb12-0000tw-13-04988
WARC-IP-Address: 184.85.26.15
WARC-Payload-Digest: sha1:PNCB5NNAA766RLLISZ6ODV3FJZBCATKR
WARC-Target-URI: http://www.allchocolate.com/health/basics/
WARC-Record-ID: <uuid:ff32c863-5066-4f51-802a-f31d4af074d5>
Content-Type: application/http; msgtype=response
Content-Length: 14577




    <title>sample document</title>


    hello world


</uuid:ff32c863-5066-4f51-802a-f31d4af074d5></p></body></html>

That means you can get hello world like this: 这意味着您可以像下面这样hello world

require 'nokogiri'

doc = Nokogiri::HTML(open('./html.html'))
title = doc.at_xpath('//title')
puts title.next.text.strip

--output:--
hello world

Another approach is to get rid of the non-html content before parsing with Nokogiri: 另一种方法是在用Nokogiri解析之前摆脱非HTML内容:

require 'nokogiri'

infile = File.open('html.html')
non_html = infile.gets(sep="\n\n")
html = infile.gets(nil)  #Slurp the rest of the file

doc = Nokogiri::HTML(html)
puts doc.at_xpath('//body').text.strip

--output:--
hello world

That assumes there's always a blank line separating the non-html content from the html content. 假定始终有一个空白行,将非html内容与html内容分隔开。

First of all @7stud answer is spot on that you can break you file on \\n\\n but in my documents collection it's not always \\n\\n before actual html code. 首先,@ 7stud答案很明显,您可以在\\n\\n上断开文件,但在我的文档集中,并非总是\\n\\n在实际的html代码之前。

So using the same idea i have came with another workaround that to remove all the text before html start tag using regex and then pass it to Nokogiri to parse. 因此,使用相同的想法,我想到了另一个解决方法,即使用正则表达式删除html start标记之前的所有文本,然后将其传递给Nokogiri进行解析。

file = File.read(file_path).to_s
file = file.sub(/.*?(?=<html)/im,"")
page = Nokogiri::HTML(file)

Now it is working fine. 现在工作正常。

It's simple to preprocess the content before passing it to Nokogiri: 在将内容传递给Nokogiri之前,对其进行预处理很简单:

require 'nokogiri'

text = '
WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-11T04:48:01Z
WARC-TREC-ID: clueweb12-0000tw-13-04988
WARC-IP-Address: 184.85.26.15
WARC-Payload-Digest: sha1:PNCB5NNAA766RLLISZ6ODV3FJZBCATKR
WARC-Target-URI: http://www.allchocolate.com/health/basics/
WARC-Record-ID: <urn:uuid:ff32c863-5066-4f51-802a-f31d4af074d5>
Content-Type: application/http; msgtype=response
Content-Length: 14577

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <title>sample document</title>
</head>
<body>
    hello world
</body>
</html>
'

doc = Nokogiri::HTML(text[/<!DOCTYPE.+/m])
doc.to_html # => "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n    <title>sample document</title>\n</head>\n<body>\n    hello world\n</body>\n</html>\n"

The trick is: 诀窍是:

text[/<!DOCTYPE.+/m]

which tells Ruby to look through the text and return all the text from <!DOCTYPE to the end of the string, which is valid HTML. 它告诉Ruby浏览文本并将所有文本从<!DOCTYPE到字符串的末尾,即有效的HTML。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM