简体   繁体   English

JSoup:从类标记中提取一个单词

[英]JSoup: Extracting one word from within a class tag

I've been using JSoup for the last few weeks to successfully scrape data from a web page; 在过去的几周中,我一直在使用JSoup来成功地从网页中抓取数据。 however, I've come to a dead end in trying to figure out a way to extract just a single word from within a class tag, instead of the whole text. 但是,在尝试找出一种从类标记中仅提取单个单词而不是整个文本的方法时,我走到了尽头。

Here is the Java code I'm using: 这是我正在使用的Java代码:

// store all the search results in the elmAllSearchResults element
Element elmAllSearchResults = doc.getElementById("SearchResults"); 
// extract the detDesc class from elmAllSearchResults
Elements elmSize = elmAllSearchResults.getElementsByClass("desc");

To extract multiple lines similar to this: 要提取类似于以下内容的多行:

<font class="desc">Date 11-04; 09:21, Size 8100.00 MB, User <a class="desc" href="/member/aUser/" title="Browse">
<font class="desc">Date 12-04; 09:21, Size 62 MB, User <a class="desc" href="/member/bUser/" title="Browse">

But now all I want to be able to do is extract the size (8100.00 MB, and 62 MB in this case) from this string of text. 但是现在我要做的就是从此文本字符串中提取大小(8100.00 MB,在这种情况下为62 MB)。 As the size is not easily identifiable by being wrapped in any tags I can't seem to find a way to get it. 由于不能通过包裹在任何标签中来轻松识别大小,所以我似乎找不到找到它的方法。

Is it possible? 可能吗?

Thank You. 谢谢。

Jsoup goes only as far until it reaches individual HTML elements. Jsoup仅会到达单个HTML元素为止。 If you want to parse their textual bodies, which are essentially String s, then you'd need to grab String based methods instead such as substring() , indexOf() , replaceAll() , etc. 如果要解析其文本主体(本质上是String ,则需要获取基于String的方法,例如substring()indexOf()replaceAll()等。

For example, if you can guarantee that the desired information is always between ", Size " and ", User" , then you should substring the String on that: 例如,如果您可以保证所需的信息始终在", Size "", User" ,则应在该String上将StringString

String before = ", Size ";
String after = ", User";

for (Element element : elements) {
    String text = element.text();
    String size = text.substring(text.indexOf(before) + before.length(), text.indexOf(after));
    // ...
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM