简体   繁体   English

Java正则表达式从文本中检索链接

[英]java regex to retrieve link from text

I have a input String as: 我有一个输入String为:

String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it";

I want to convert this text to: 我想将此文本转换为:

Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it

So here: 所以在这里:

1) I want to replace the link tag with plain link. 1)我想用纯链接替换链接标签。 If the tag contains label then it should go in braces after the URL. 如果标签包含标签,则应在URL后使用大括号。

2) If the URL is relative, I want to prefix the base URL ( http://www.google.com ). 2)如果该URL是相对的,我想给基本URL加上前缀( http://www.google.com )。

3) I want to append a parameter to the URL. 3)我想在URL后面附加一个参数。 (&myParam=pqr) (&myParam = PQR)

I am having issues retrieving the tag with URL and label, and replacing it. 我在使用URL和标签检索标签并将其替换时遇到问题。

I wrote something like: 我写了类似的东西:

public static void main(String[] args) {
    String text = "String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it";";
    text = text.replaceAll("&lt;", "<");
    text = text.replaceAll("&gt;", ">");
    text = text.replaceAll("&amp;", "&");

    // this is not working
    Pattern p = Pattern.compile("href=\"(.*?)\"");
    Matcher m = p.matcher(text);
    String url = null;
    if (m.find()) {
        url = m.group(1);

    }
}

// helper method to append new query params once I have the url
public static URI appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
    URI oldUri = new URI(uriToUpdate);
    String newQueryParams = oldUri.getQuery();
    if (newQueryParams == null) {
        newQueryParams = queryParamsToAppend;
    } else {
        newQueryParams += "&" + queryParamsToAppend;  
    }
    URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
            oldUri.getPath(), newQueryParams, oldUri.getFragment());
    return newUri;
}

Edit1: EDIT1:

Pattern p = Pattern.compile("HREF=\"(.*?)\"");

This works. 这可行。 But then I want it to be capitalization agnostic. 但是,我希望它与大写无关。 Href, HRef, href, hrEF, etc. all should work. Href,HRef,href,hrEF等都应该起作用。

Also, how do I handle if my text has several URLs. 另外,如果我的文本具有多个URL,该如何处理。

Edit2: EDIT2:

Some progress. 一些进步。

Pattern p = Pattern.compile("href=\"(.*?)\"");
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
  url = m.group(1);
  System.out.println(url);
}

This handles the case of multiple URLs. 这样可以处理多个URL。

Last pending issue is, how do I get hold of the label and replace the href tags in original text with URL and label. 最后一个悬而未决的问题是,如何获得标签并将原始文本中的href标签替换为URL和标签。

Edit3: EDIT3:

By multiple URL cases, I mean there are multiple url present in given text. 通过多个URL情况,我的意思是给定文本中存在多个URL。

String text = "Some content which contains link as &lt;A HREF=\"/relative-path/fruit.cgi?param1=abc&amp;param2=xyz\"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF=\"/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz\"&gt;URL2 Label&lt;/A&gt; and some more text";

Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
 url = m.group(1); // this variable should contain the link URL
 url = appendBaseURI(url);
 url = appendQueryParams(url, "license=ABCXYZ");
 System.out.println(url);
}

You can use apache commons text StringEscapeUtils to decode the html entities and then replaceAll , ie: 您可以使用apache commons文本 StringEscapeUtils来解码html实体,然后使用replaceAll ,即:

import org.apache.commons.text.StringEscapeUtils;

String text = "Some content which contains link as &lt;A HREF=\"/relative-path/fruit.cgi?param1=abc&amp;param2=xyz\"&gt;URL Label&lt;/A&gt; and some text after it";
String output = StringEscapeUtils.unescapeHtml4(text).replaceAll("([^<]+).+\"(.*?)\">(.*?)<[^>]+>(.*)", "$1https://google.com$2&your_param ($3)$4");
System.out.print(output);
// Some content which contains link as https://google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&your_param (URL Label) and some text after it

Demos: 演示:

  1. jdoodle jdoodle
  2. Regex Explanation 正则表达式说明
public static void main(String args[]) {
    String text = "Some content which contains link as &lt;A HREF=\"/relative-path/fruit.cgi?param1=abc&amp;param2=xyz\"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF=\"/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz\"&gt;URL2 Label&lt;/A&gt; and some more text";
    text = StringEscapeUtils.unescapeHtml4(text);
    Pattern p = Pattern.compile("<a href=\"(.*?)\">(.*?)</a>", Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(text);
    while (m.find()) {
        text = text.replace(m.group(0), cleanUrlPart(m.group(1), m.group(2)));
    }
    System.out.println(text);
}

private static String cleanUrlPart(String url, String label) {
    if (!url.startsWith("http") && !url.startsWith("www")) {
        if (url.startsWith("/")) {
            url = "http://www.google.com" + url;
        } else {
            url = "http://www.google.com/" + url;
        }
    }
    url = appendQueryParams(url, "myParam=pqr").toString();
    if (label != null && !label.isEmpty()) url += " (" + label + ")";
    return url;
}

Output 产量

Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it and another link http://www.google.com/relative-path/vegetables.cgi?param1=abc&param2=xyz&myParam=pqr (URL2 Label) and some more text

// this is not working //这不起作用

Because your regex is case-sensitive. 因为您的正则表达式区分大小写。

Try:- 尝试:-

Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);

Edit1 : 编辑1
To get the label, use Pattern.compile("(?<=>).*?(?=</a>)", Pattern.CASE_INSENSITIVE) and m.group(0) . 要获取标签,请使用Pattern.compile("(?<=>).*?(?=</a>)", Pattern.CASE_INSENSITIVE)m.group(0)

Edit2 : 编辑2
To replace the tag (including label) with your final string, use:- 要将标签(包括标签)替换为您的最终字符串,请使用:-

text.replaceAll("(?i)<a href=\"(.*?)</a>", "new substring here")

Almost there: 快好了:

public static void main(String[] args) throws URISyntaxException {
        String text = "Some content which contains link as &lt;A HREF=\"/relative-path/fruit.cgi?param1=abc&amp;param2=xyz\"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF=\"/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz\"&gt;URL2 Label&lt;/A&gt; and some more text";
        text = StringEscapeUtils.unescapeHtml4(text);
        System.out.println(text);
        System.out.println("**************************************");
        Pattern patternTag = Pattern.compile("<a([^>]+)>(.+?)</a>", Pattern.CASE_INSENSITIVE);
        Pattern patternLink = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
        Matcher matcherTag = patternTag.matcher(text);

        while (matcherTag.find()) {
            String href = matcherTag.group(1); // href
            String linkText = matcherTag.group(2); // link text
            System.out.println("Href: " + href);
            System.out.println("Label: " + linkText);
            Matcher matcherLink = patternLink.matcher(href);
            String finalText = null;
            while (matcherLink.find()) {
                String link = matcherLink.group(1);
                System.out.println("Link: " + link);
                finalText = getFinalText(link, linkText);
                break;
            }
            System.out.println("***************************************");
            // replacing logic goes here
        }
        System.out.println(text);
    }

    public static String getFinalText(String link, String label) throws URISyntaxException {
        link = appendBaseURI(link);
        link = appendQueryParams(link, "myParam=ABCXYZ");
        return link + " (" + label + ")";
    }

    public static String appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
        URI oldUri = new URI(uriToUpdate);
        String newQueryParams = oldUri.getQuery();
        if (newQueryParams == null) {
            newQueryParams = queryParamsToAppend;
        } else {
            newQueryParams += "&" + queryParamsToAppend;  
        }
        URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
                oldUri.getPath(), newQueryParams, oldUri.getFragment());
        return newUri.toString();
    }

    public static String appendBaseURI(String url) {
        String baseURI = "http://www.google.com/";
        if (url.startsWith("/")) {
            url = url.substring(1, url.length());
        }
        if (url.startsWith(baseURI)) {
            return url;
        } else {
            return baseURI + url;
        }
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM