如何使用Nokogiri解析此HTML代码？

Question

I have the following HTML: 我有以下HTML：

<h3><strong>Adresse:</strong></h3>
    <p>
Hochschule Darmstadt<br>
TechnologieTransferCentrum<br>
D19, Raum 221, 222<br>
Schöfferstraße 10<br>
<b>64295 Darmstadt</b><p>
<h3>Kommunikationsdaten: </h3> 
<p>

But the  and   tags are not closed. 但是和 标记未关闭。

How do I extract the address information: 如何提取地址信息：

Hochschule Darmstadt
TechnologieTransferCentrum
D19, Raum 221, 222
Schöfferstraße 10
64295 Darmstadt

Answer 1

Starting from this basis: 从此基础开始：

# encoding: UTF-8
require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<h3><strong>Adresse:</strong></h3>
    <p>
Hochschule Darmstadt<br>
TechnologieTransferCentrum<br>
D19, Raum 221, 222<br>
Schöfferstraße 10<br>
<b>64295 Darmstadt</b><p>
<h3>Kommunikationsdaten: </h3> 
<p>
EOT

puts doc.errors
puts doc.to_html

I get this when I run the code: 我在运行代码时得到了这个：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<h3><strong>Adresse:</strong></h3>
    <p>
Hochschule Darmstadt<br>
TechnologieTransferCentrum<br>
D19, Raum 221, 222<br>
Schöfferstraße 10<br><b>64295 Darmstadt</b></p>
<p>
</p>
<h3>Kommunikationsdaten: </h3>
<p></p>
</body></html>

Notice that Nokogiri has added the <html> and <body> tags. 请注意，Nokogiri已添加<html>和<body>标记。 Also, it has closed the  tags, adding  . 此外，它还关闭了标签，并添加了 。 We can tell it to parse the HTML as a fragment, and not add the headers using instead: 我们可以告诉它将HTML解析为片段，而不是使用来添加标头：

Nokogiri::HTML::DocumentFragment.parse

Which generates: 会产生：

<h3><strong>Adresse:</strong></h3>
    <p>
Hochschule Darmstadt<br>
TechnologieTransferCentrum<br>
D19, Raum 221, 222<br>
Schöfferstraße 10<br><b>64295 Darmstadt</b></p><p>
</p><h3>Kommunikationsdaten: </h3>
<p></p>

There's still fixup on the HTML happening, but it's the basic HTML passed in. Either way, the resulting HTML is technically correct. HTML仍在修复中，但它是传入的基本HTML。无论哪种方式，生成的HTML在技术上都是正确的。

On to finding the text in question: If there is only one  tag, or it's the first one: 寻找有关文本：如果只有一个标记，或者它是第一个：

doc.at('p').text
=> "\nHochschule Darmstadt\nTechnologieTransferCentrum\nD19, Raum 221, 222\nSchöfferstraße 1064295 Darmstadt"

Or: 要么：

doc.at('h3').next_sibling.next_sibling.text
=> "\nHochschule Darmstadt\nTechnologieTransferCentrum\nD19, Raum 221, 222\nSchöfferstraße 1064295 Darmstadt"

Two next_sibling methods are needed. 需要两个next_sibling方法。 The first finds the text node immediately following the end of the <h3> node: 第一个在<h3>节点的结尾之后立即找到文本节点：

doc.at('h3').next_sibling
=> #<Nokogiri::XML::Text:0x3fef59dedfb8 "\n    ">

Answer 2

Assuming you have parsed the document in doc , this: 假设您已经在doc解析了文档，则：

puts doc.at('//h3[contains(strong, "Adresse:")]/following-sibling::p').text

will give you the following output: 将为您提供以下输出：

Hochschule Darmstadt
TechnologieTransferCentrum
D19, Raum 221, 222
Schöfferstraße 10
64295 Darmstadt

如何使用Nokogiri解析此HTML代码？

问题描述

2 个解决方案

解决方案1
0 2013-05-04 05:32:08

解决方案2
0 2013-05-04 23:06:16

如何使用Nokogiri解析此HTML代码？

问题描述

2 个解决方案

解决方案1 0 2013-05-04 05:32:08

解决方案2 0 2013-05-04 23:06:16

解决方案1
0 2013-05-04 05:32:08

解决方案2
0 2013-05-04 23:06:16