简体   繁体   English

使用Java中的正则表达式从url中提取一些内容

[英]Extract some contents from the url using regular expressions in java

I want to extract contents from this url http://www.xyz.com/default.aspx and this is the below content that I want to extract using regular expression. 我想从此URL http://www.xyz.com/default.aspx提取内容,这是我想使用正则表达式提取的以下内容。

String expr = "
What Regular Expression should I use here    
"; 

Pattern patt = Pattern.compile(expr, Pattern.DOTALL | Pattern.UNIX_LINES);
URL url4 = null;

try {
    url4 = new URL("http://www.xyz.com/default.aspx");                  
} catch (MalformedURLException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}
System.out.println("Text" +url4);
Matcher m = null;
try {
    m = patt.matcher(getURLContent(url4));
} catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}
System.out.println("Match" +m);

while (m.find()) {
    String stateURL = m.group(1);
    System.out.println("Some Data" +stateURL);
}

public static CharSequence getURLContent(URL url8) throws IOException {
          URLConnection conn = url8.openConnection();
          String encoding = conn.getContentEncoding();
          if (encoding == null) {
            encoding = "ISO-8859-1";
          }
          BufferedReader br = new BufferedReader(new
              InputStreamReader(conn.getInputStream(), encoding));
          StringBuilder sb = new StringBuilder(16384);
          try {
            String line;
            while ((line = br.readLine()) != null) {
              sb.append(line);
              System.out.println(line);
              sb.append('\n');
            }
          } finally {
            br.close();
          }
          return sb;
        }

As @bkent314 has mentioned, jsoup is a better and cleaner approach than using regular expression. 正如@ bkent314所提到的,与使用正则表达式相比, jsoup是一种更好,更清洁的方法。

If you inspect the source code of that website, you basically want content from this snippet:- 如果您查看该网站的源代码,则基本上需要该片段中的内容:

<div class="smallHd_contentTd">
    <div class="breadcrumb">...</div>
    <h2>Services</h2>
    <p>...</p>
    <p>...</p>
    <p>...</p>
</div>

By using jsoup, your code may look something like this:- 通过使用jsoup,您的代码可能看起来像这样:

Document doc = Jsoup.connect("http://www.ferotech.com/Services/default.aspx").get();

Element content = doc.select("div.smallHd_contentTd").first();

String header = content.select("h2").first().text();

System.out.println(header);

for (Element pTag : content.select("p")) {
    System.out.println(pTag.text());
}

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM