[英]Using Java 6 and Jsoup 1.7.3, how can I parse this HTML where sibling text is not inside an element?
Mainly, my question is how can I parse ... 主要是,我的问题是如何解析......
<p>some text<br />
<br />
<strong>categorized: </strong>like this<br />
<br /></p>
... where I am ultimately interested in obtaining key value pairs like "categorized","like this" using Java and Jsoup? ...我最终有兴趣使用Java和Jsoup获取像“分类”,“像这样”的关键值对? I am looking at the <strong> tag to be some kind of a delimiter I can use to indicate the key, then its following text which is inconveniently not enclosed in a tag I need to grab as the value.
我正在看<strong>标签是某种分隔符我可以用来指示密钥,然后它的后续文本不方便地没有包含在我需要抓取的标签中作为值。
I think the challenge for me is the "like this" part is not in an element. 我认为对我来说挑战是“喜欢这个”部分不在元素中。 It is a sibling node but it is not selectable with CSS, so I can't find it with Jsoup.
它是一个兄弟节点,但它不能用CSS选择,所以我用Jsoup找不到它。 I am not clear on how the Node and Element relationship works in Jsoup in such a way that I can get both the element text "categorized" and its sibling "like this" in a single call.
我不清楚节点和元素关系在Jsoup中是如何工作的,这样我就可以在单个调用中同时获得元素文本“分类”和它的兄弟“像这样”。
In more detail, I do not have control over the HTML structure since I am trying to collect data from many Consumer Product Safety Commission web pages. 更详细地说,由于我试图从许多消费者产品安全委员会网页收集数据,因此我无法控制HTML结构。 The pages are formatted in a few different ways, but there is one format in particular that is causing me problems using Java and Jsoup to parse out data.
页面的格式有几种不同的方式,但有一种格式特别导致我使用Java和Jsoup解析数据时出现问题。
<div class="archived">
<p style="text-align: center;"><strong><span style="color: #ff0000;">Note: The hotline number and ...</span></strong></p>
<h2 style="text-align: left;">CPSC, Elkay Manufacturing Co. Announces ...</h2>
<p>WASHINGTON, D.C. - The U.S. Consumer Product Safety Commission ...<br />
<br />
<strong>Name of product:</strong> Elkay hot/cold bottled water coolers <br />
<br />
<br />
<strong>Units:</strong> 145,000<br />
<br />
<strong>Description:</strong> These 115 volt hot/cold bottled water coolers ... <br />
<p><img title="Picture of Recalled Water Cooler" src="/PageFiles/73998/04175.jpg" alt="Picture of Recalled Water Cooler" width="110" height="434" /></p>
</div>
That particular section of HTML is shortened, but it originates from http://www.cpsc.gov/en/Recalls/2004/CPSC-NETGEAR-Inc-Announce-Recall-of-Wall-Plug-Ethernet-Bridges-/ HTML的特定部分缩短了,但它起源于http://www.cpsc.gov/en/Recalls/2004/CPSC-NETGEAR-Inc-Announce-Recall-of-Wall-Plug-Ethernet-Bridges-/
String url = "http://www.cpsc.gov/en/Recalls/2004/CPSC-NETGEAR-Inc-Announce-Recall-of-Wall-Plug-Ethernet-Bridges-/";
Document doc = Jsoup.connect(url).get();
Elements archived = doc.select("div.archived > *");
for(Element ele : archived) {
//what goes here to get those key/value pairs?
}
This isn't a complete answer but it'll get you 95% there. 这不是一个完整的答案,但它会让你95%。
String url="http://www.cpsc.gov/en/Recalls/2004/CPSC-NETGEAR-Inc-Announce-Recall-of-Wall-Plug-Ethernet-Bridges-/";
Document doc = Jsoup.connect(url).get();
Elements archived = doc.select("div.archived strong");
for (Element element: archived){
System.out.println("KEY: " + element.text());
System.out.println("VALUE: " + element.nextSibling());
}
Output: 输出:
KEY: Firm's Hotline: (800) 303-5507
VALUE: <br />
KEY: Name of product:
VALUE: Wall Plug Ethernet Bridge
KEY: Units:
VALUE: About 53,500 units
KEY: Manufacturer:
VALUE: NETGEAR Inc., of Santa Clara, Calif.
KEY: Hazard:
VALUE: The plastic housing on these units can detach, posing a shock hazard.
and so on...
As you can see, it'll require a little bit of work to disregard the unnecessary stuff, like the first element KEY/VALUE pair and whatnot, but otherwise it should work! 正如你所看到的,它需要一些工作来忽略不必要的东西,比如第一个元素KEY / VALUE对和诸如此类的东西,但是否则它应该工作! Good luck.
祝好运。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.