简体   繁体   English

Java:我有一个很大的html字符串,需要提取href =“…”文本

[英]Java: I have a big string of html and need to extract the href=“…” text

I have this string containing a large chunk of html and am trying to extract the link from href="..." portion of the string. 我有一个包含很大一部分html的字符串,并且正在尝试从字符串的href =“ ...”部分提取链接。 The href could be in one of the following forms: href可以采用以下形式之一:

<a href="..." />
<a class="..." href="..." />

I don't really have a problem with regex but for some reason when I use the following code: 我确实没有正则表达式的问题,但是由于某些原因,当我使用以下代码时:

        String innerHTML = getHTML(); 
  Pattern p = Pattern.compile("href=\"(.*)\"", Pattern.DOTALL);
  Matcher m = p.matcher(innerHTML);
  if (m.find()) {
   // Get all groups for this match
   for (int i=0; i<=m.groupCount(); i++) {
    String groupStr = m.group(i);
    System.out.println(groupStr);

   }
  }

Can someone tell me what is wrong with my code? 有人可以告诉我我的代码有什么问题吗? I did this stuff in php but in Java I am somehow doing something wrong... What is happening is that it prints the whole html string whenever I try to print it... 我在php中做过这些事情,但是在Java中我却以某种方式做错了什么。正在发生的事情是,每当我尝试打印它时,它都会打印整个html字符串...

EDIT: Just so that everyone knows what kind of a string I am dealing with: 编辑:以便每个人都知道我正在处理哪种字符串:

<a class="Wrap" href="item.php?id=43241"><input type="button">
    <span class="chevron"></span>
  </a>
  <div class="menu"></div>

Everytime I run the code, it prints the whole string... That's the problem... 每次我运行代码时,它都会打印整个字符串...这就是问题所在...

And about using jTidy... I'm on it but it would be interesting to know what went wrong in this case as well... 关于使用jTidy ...我正在研究它,但是知道在这种情况下出了什么问题也很有趣...

.* 

This is an greedy operation that will take any character including the quotes. 这是一个贪婪的操作,它将使用包括引号在内的任何字符。

Try something like: 尝试类似:

"href=\"([^\"]*)\""

There are two problems with the code you've posted: 您发布的代码有两个问题:

Firstly the .* in your regular expression is greedy. 首先,您的正则表达式中的.*是贪婪的。 This will cause it to match all characters until the last " character that can be found. You can make this match be non-greedy by changing this to .*? . 这将使其匹配所有字符,直到找到最后一个"字符为止。您可以通过将其更改为.*?来使此匹配不贪心。

Secondly, to pick up all the matches, you need to keep iterating with Matcher.find rather than looking for groups. 其次,要获取所有匹配项,您需要使用Matcher.find进行迭代,而不是查找组。 Groups give you access to each parenthesized section of the regex. 组使您可以访问正则表达式的每个括号部分。 You however, are looking for each time the whole regular expression matches. 但是,您每次都在寻找整个正则表达式匹配时。

Putting these together gives you the following code which should do what you need: 将它们放在一起将为您提供以下代码,这些代码应该可以满足您的需求:

Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.DOTALL);
Matcher m = p.matcher(innerHTML);

while (m.find()) 
{
    System.out.println(m.group(1));
}

Regex is great but not the right tool for this particular purpose. 正则表达式非常有用,但不适用于此特定目的。 Normally you want to use a stackbased parser for this. 通常,您要为此使用基于堆栈的解析器。 Have a look at Java HTML parser API's like jTidy . 看一下Java HTML解析器API的jTidy

Use a built in parser. 使用内置的解析器。 Something like: 就像是:

    EditorKit kit = new HTMLEditorKit();
    HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
    doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
    kit.read(reader, doc, 0);

    HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);

    while (it.isValid())
    {
        SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();
        String href = (String)s.getAttribute(HTML.Attribute.HREF);
        System.out.println( href );
        it.next();
    }

Or use the ParserCallback: 或使用ParserCallback:

import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.parser.*;
import javax.swing.text.html.*;

public class ParserCallbackText extends HTMLEditorKit.ParserCallback
{
    public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
    {
        if (tag.equals(HTML.Tag.A))
        {
            String href = (String)a.getAttribute(HTML.Attribute.HREF);
            System.out.println(href);
        }
    }

    public static void main(String[] args)
        throws Exception
    {
        Reader reader = getReader(args[0]);
        ParserCallbackText parser = new ParserCallbackText();
        new ParserDelegator().parse(reader, parser, true);
    }

    static Reader getReader(String uri)
        throws IOException
    {
        // Retrieve from Internet.
        if (uri.startsWith("http:"))
        {
            URLConnection conn = new URL(uri).openConnection();
            return new InputStreamReader(conn.getInputStream());
        }
        // Retrieve from file.
        else
        {
            return new FileReader(uri);
        }
    }
}

The Reader could be a StringReader. Reader可以是StringReader。

Another easy and reliable way to do it is by using Jsoup 另一种简单可靠的方法是使用Jsoup

Document doc = Jsoup.connect("http://example.com/").get();
Elements links = doc.select("a[href]");
for (Element link : links){
  System.out.println(link.attr("abs:href"));
}

you may use a html parser library. 您可以使用html解析器库。 jtidy for example gives you a DOM model of the html, from wich you can extract all "a" elements and read their "href" attribute 例如, jtidy为您提供了html的DOM模型,从中您可以提取所有“ a”元素并读取其“ href”属性

"href=\\"(.*?)\\""也应该起作用,但是我认为Kugel的答案会更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM