简体   繁体   English

Java StringTokenizer麻烦-新手

[英]Java StringTokenizer troubles - Newbie

I know I'm probably being incredibly stupid here, but can anybody shed any light on my problem? 我知道我在这里可能会变得非常愚蠢,但是有人能说明我的问题吗? I'm trying to extract the title from a string containing html... 我正在尝试从包含html的字符串中提取标题...

 public static void main(String args[]) {
  System.out.println(getTitle("<title>this is it</title>"));
 }

 public static String getTitle(String a) {
  StringTokenizer token = new StringTokenizer(a, "<title>", false);
  return token.nextToken("</title>");
 }

Keeps returning "h" and I can't work out why! 不断返回“ h”,我不知道为什么! Am is being naive? 天真吗?

Cheers 干杯

I think your problem lies here (quote from the API doc , text bolded by me): 我认为您的问题出在这里( API文档的引号,由我加粗的文本):

"The set of delimiters (the characters that separate tokens) may be specified either at creation time or on a per-token basis." 分隔符集 (分隔标记的字符)可以在创建时或在每个标记的基础上指定。”

That is, the delimiter is not a string, but a set of characters. 也就是说,定界符不是字符串,而是一组字符。 When you pass "<title>" as second parameter, you tell your tokenizer that the delimiters are any of the characters < , t , i , t , l , e or > . 当你通过"<title>"作为第二个参数,你告诉你的标记生成器的分隔符的任何字符<title> Thus the tokenizer dutifully skips all the characters in the first tag and then t , and returns h because that is not in the set of tokens you gave it, but the next character ( e ) is. 因此,令牌生成器会忠实地跳过第一个标记中的所有字符,然后跳过t ,并返回h因为它不在您给它的令牌集中,而下一个字符( e )在其中。

So StringTokenizer is not quite what you need here. 因此, StringTokenizer并不是您所需要的。 Note also this remark from the API docs: 另请注意API文档中的以下说明:

" StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead." StringTokenizer是一个遗留类,尽管出于兼容性原因保留了该类,尽管在新代码中不鼓励使用它。建议寻求该功能的任何人都应使用Stringsplit方法或java.util.regex包。

Or use a third party library, as has been noted by others. 或使用第三方库,正如其他人所指出的那样。

I am not sure if StringTokenizer is the best class to use in your scenario. 我不确定StringTokenizer是否是您的方案中使用的最佳类。 Maybe you can solve your task by using String.subString(int, int). 也许您可以使用String.subString(int,int)解决任务。 As BearsWillEatYou indicated, if you want to do more sophisticated HTML Parsing, use some third party library. 正如BearsWillEatYou指出的那样,如果要执行更复杂的HTML解析,请使用一些第三方库。

public static void main(String args[]) {
    System.out.println(getTitle("<title>this is it</title>"));
}

public static String getTitle(String a) {
    return a.substring(a.indexOf("<title>") + "<title>".length(), a.indexOf("</title>"))
}

The delimiter you specified is "", which is the empty string. 您指定的定界符为“”,这是一个空字符串。 There is an empty string between the "t" and "h" at the start ofyour string, thus nextToken returns "t". 在您的字符串的开头,“ t”和“ h”之间有一个空字符串,因此nextToken返回“ t”。 It is normal, and works as specified. 这是正常现象,并且可以按指定方式工作。 See http://java.sun.com/j2se/1.4.2/docs/api/java/util/StringTokenizer.html 参见http://java.sun.com/j2se/1.4.2/docs/api/java/util/StringTokenizer.html

You cannot use StringTokenizer this way. 您不能以这种方式使用StringTokenizer。 See the javadoc http://java.sun.com/j2se/1.4.2/docs/api/java/util/StringTokenizer.html 参见javadoc http://java.sun.com/j2se/1.4.2/docs/api/java/util/StringTokenizer.html

The delims argument contains the set of characters that are considered as delimiters in the string. delims参数包含被视为字符串中定界符的字符集。 Thus here, you have "<", "t", "i", ... as delimiters. 因此,在这里,您将“ <”,“ t”,“ i”,...作为定界符。

for that kind of work, you really should consider using an html or xml dedicated library. 对于此类工作,您确实应该考虑使用html或xml专用库。 You could also use "<>" as delimiters, and implement of minimalist html parser suiting your needs, but this will probably lead to bugs, headaches, and more bugs once your minimal needs extends. 您还可以使用“ <>”作为分隔符,并实现适合您需求的简约html解析器,但是一旦您的最低需求扩展,这可能会导致错误,头痛和更多错误。

If you are parsing HTML the the best way might be HTML Cleaner, according to this SO post . 根据这篇SO post,如果您要解析HTML,最好的方法可能是HTML Cleaner

I would recommend using this domain specific library, as it will also give you an easy way to extend the functionality of your app when required. 我建议使用此特定于域的库,因为它也将为您提供一种在需要时扩展应用程序功能的简便方法。 Or help you with another app if that's also parsing HTML. 或者,如果另一个应用程序也在解析HTML,则可以为您提供帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM