I have the following HTML:
<h3><strong>Adresse:</strong></h3>
<p>
Hochschule Darmstadt<br>
TechnologieTransferCentrum<br>
D19, Raum 221, 222<br>
Schöfferstraße 10<br>
<b>64295 Darmstadt</b><p>
<h3>Kommunikationsdaten: </h3>
<p>
But the <p>
and <br>
tags are not closed.
How do I extract the address information:
Hochschule Darmstadt
TechnologieTransferCentrum
D19, Raum 221, 222
Schöfferstraße 10
64295 Darmstadt
Starting from this basis:
# encoding: UTF-8
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<h3><strong>Adresse:</strong></h3>
<p>
Hochschule Darmstadt<br>
TechnologieTransferCentrum<br>
D19, Raum 221, 222<br>
Schöfferstraße 10<br>
<b>64295 Darmstadt</b><p>
<h3>Kommunikationsdaten: </h3>
<p>
EOT
puts doc.errors
puts doc.to_html
I get this when I run the code:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<h3><strong>Adresse:</strong></h3>
<p>
Hochschule Darmstadt<br>
TechnologieTransferCentrum<br>
D19, Raum 221, 222<br>
Schöfferstraße 10<br><b>64295 Darmstadt</b></p>
<p>
</p>
<h3>Kommunikationsdaten: </h3>
<p></p>
</body></html>
Notice that Nokogiri has added the <html>
and <body>
tags. Also, it has closed the <p>
tags, adding </p>
. We can tell it to parse the HTML as a fragment, and not add the headers using instead:
Nokogiri::HTML::DocumentFragment.parse
Which generates:
<h3><strong>Adresse:</strong></h3>
<p>
Hochschule Darmstadt<br>
TechnologieTransferCentrum<br>
D19, Raum 221, 222<br>
Schöfferstraße 10<br><b>64295 Darmstadt</b></p><p>
</p><h3>Kommunikationsdaten: </h3>
<p></p>
There's still fixup on the HTML happening, but it's the basic HTML passed in. Either way, the resulting HTML is technically correct.
On to finding the text in question: If there is only one <p>
tag, or it's the first one:
doc.at('p').text
=> "\nHochschule Darmstadt\nTechnologieTransferCentrum\nD19, Raum 221, 222\nSchöfferstraße 1064295 Darmstadt"
Or:
doc.at('h3').next_sibling.next_sibling.text
=> "\nHochschule Darmstadt\nTechnologieTransferCentrum\nD19, Raum 221, 222\nSchöfferstraße 1064295 Darmstadt"
Two next_sibling
methods are needed. The first finds the text node immediately following the end of the <h3>
node:
doc.at('h3').next_sibling
=> #<Nokogiri::XML::Text:0x3fef59dedfb8 "\n ">
Assuming you have parsed the document in doc
, this:
puts doc.at('//h3[contains(strong, "Adresse:")]/following-sibling::p').text
will give you the following output:
Hochschule Darmstadt
TechnologieTransferCentrum
D19, Raum 221, 222
Schöfferstraße 10
64295 Darmstadt
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.