简体   繁体   English

从字符串中检测并提取 url?

[英]Detect and extract url from a string?

This is a easy question,but I just don't get it.这是一个简单的问题,但我就是不明白。 I want to detect url in a string and replace them with a shorten one.我想检测字符串中的 url 并将它们替换为缩短的。

I found this expression from stackoverflow,But the result is just http我从stackoverflow找到了这个表达式,但结果只是http

Pattern p = Pattern.compile("\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]",Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(str);
        boolean result = m.find();
        while (result) {
            for (int i = 1; i <= m.groupCount(); i++) {
                String url=m.group(i);
                str = str.replace(url, shorten(url));
            }
            result = m.find();
        }
        return html;

Is there any better idea?有什么更好的主意吗?

Let me go ahead and preface this by saying that I'm not a huge advocate of regex for complex cases.让我继续并在此之前说我不是复杂情况下正则表达式的大力倡导者。 Trying to write the perfect expression for something like this is very difficult.试图为这样的事情写出完美的表达是非常困难的。 That said , I do happen to have one for detecting URL's and it's backed by a 350 line unit test case class that passes.也就是说,我碰巧有一个用于检测 URL 的,它由一个通过的 350 行单元测试用例类支持。 Someone started with a simple regex and over the years we've grown the expression and test cases to handle the issues we've found.有人从一个简单的正则表达式开始,多年来我们增加了表达式和测试用例来处理我们发现的问题。 It's definitely not trivial:这绝对不是微不足道的:

// Pattern for recognizing a URL, based off RFC 3986
private static final Pattern urlPattern = Pattern.compile(
        "(?:^|[\\W])((ht|f)tp(s?):\\/\\/|www\\.)"
                + "(([\\w\\-]+\\.){1,}?([\\w\\-.~]+\\/?)*"
                + "[\\p{Alnum}.,%_=?&#\\-+()\\[\\]\\*$~@!:/{};']*)",
        Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);

Here's an example of using it:下面是一个使用它的例子:

Matcher matcher = urlPattern.matcher("foo bar http://example.com baz");
while (matcher.find()) {
    int matchStart = matcher.start(1);
    int matchEnd = matcher.end();
    // now you have the offsets of a URL match
}
/**
 * Returns a list with all links contained in the input
 */
public static List<String> extractUrls(String text)
{
    List<String> containedUrls = new ArrayList<String>();
    String urlRegex = "((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern pattern = Pattern.compile(urlRegex, Pattern.CASE_INSENSITIVE);
    Matcher urlMatcher = pattern.matcher(text);

    while (urlMatcher.find())
    {
        containedUrls.add(text.substring(urlMatcher.start(0),
                urlMatcher.end(0)));
    }

    return containedUrls;
}

Example:例子:

List<String> extractedUrls = extractUrls("Welcome to https://stackoverflow.com/ and here is another link http://www.google.com/ \n which is a great search engine");

for (String url : extractedUrls)
{
    System.out.println(url);
}

Prints:印刷:

https://stackoverflow.com/
http://www.google.com/

m.group(1) gives you the first matching group, that is to say the first capturing parenthesis. m.group(1) 为您提供第一个匹配组,即第一个捕获括号。 Here it's (https?|ftp|file)这是(https?|ftp|file)

You should try to see if there is something in m.group(0), or surround all your pattern with parenthesis and use m.group(1) again.您应该尝试查看 m.group(0) 中是否存在某些内容,或者用括号将所有模式括起来并再次使用 m.group(1)。

You need to repeat your find function to match the next one and use the new group array.您需要重复您的 find 函数以匹配下一个并使用新的组数组。

Detecting URLs is not an easy task.检测 URL 并非易事。 If its enough for you to get a string that starts with https?|ftp|file then it could be fine.如果它足以让您获得以 https?|ftp|file 开头的字符串,那么它可能没问题。 Your problem here is, that you have a capturing group, the () and those are only around the first part http...你的问题是,你有一个捕获组, ()和那些只在第一部分 http ...

I would make this part a non capturing group using (?:) and put brackets around the whole thing.我会使用 (?:) 将这部分设为非捕获组,并在整个内容周围加上括号。

"\\b((?:https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])"

With some extra brackets around the whole thing (except word boundary at start) it should match the whole domain name:在整个事物周围加上一些额外的括号(开头的单词边界除外)它应该匹配整个域名:

"\\b((https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])"

I don't think that regex matches the whole url though.我不认为正则表达式匹配整个网址。

https://github.com/linkedin/URL-Detector https://github.com/linkedin/URL-Detector

        <groupId>io.github.url-detector/</groupId>
        <artifactId>url-detector</artifactId>
        <version>0.1.23</version>

I tried all examples here for extracting different urls like these and neither works perfect for all:我在这里尝试了所有示例来提取这些不同的 url,但都不是完美的:

http://example.com http://example.com
https://example.com.ua https://example.com.ua
www.example.ua www.example.ua
https://stackoverflow.com/question/5713558/detect-and-extract-url-from-a-string https://stackoverflow.com/question/5713558/detect-and-extract-url-from-a-string
https://www.google.com/search?q=how+to+extract+link+from+text+java+example&rlz=1C1GCEU_en-GBUA932UA932&oq=how+to+extract+link+from+text+java+example&aqs=chrome..69i57j33i22i29i30.15020j0j7&sourceid=chrome&ie=UTF-8 https://www.google.com/search?q=how+to+extract+link+from+text+java+example&rlz=1C1GCEU_en-GBUA932UA932&oq=how+to+extract+link+from+text+java+example&aqs=铬..69i57j33i22i29i30.15020j0j7&sourceid=chrome&ie=UTF-8

And I wrote my regEx and a method for making it which works with text with multiple links in it:我写了我的 regEx 和一种制作它的方法,它可以处理带有多个链接的文本:

private static final String LINK_REGEX = "((http:\\/\\/|https:\\/\\/)?(www.)?(([a-zA-Z0-9-]){2,2083}\\.){1,4}([a-zA-Z]){2,6}(\\/(([a-zA-Z-_\\/\\.0-9#:?=&;,]){0,2083})?){0,2083}?[^ \\n]*)";
private static final String TEXT_WITH_LINKS_EXAMPLE = "link1:http://example.com link2: https://example.com.ua link3 www.example.ua\n" +
        "link4- https://stackoverflow.com/questions/5713558/detect-and-extract-url-from-a-string\n" +
        "link5 https://www.google.com/search?q=how+to+extract+link+from+text+java+example&rlz=1C1GCEU_en-GBUA932UA932&oq=how+to+extract+link+from+text+java+example&aqs=chrome..69i57j33i22i29i30.15020j0j7&sourceid=chrome&ie=UTF-8";

And method which returns ArrayList with links:以及返回带有链接的 ArrayList 的方法:

 private ArrayList<String> getAllLinksFromTheText(String text) {
    ArrayList<String> links = new ArrayList<>();
    Pattern p = Pattern.compile(LINK_REGEX, Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(text);
    while (m.find()) {
        links.add(m.group());
    }
    return links;
}

That's all.就这样。 Call this method with TEXT_WITH_LINKS_EXAMPLE parameter and will receive five links from the text.使用 TEXT_WITH_LINKS_EXAMPLE 参数调用此方法,将收到来自文本的五个链接。

This little code snippet / function will effectively extract URL strings from a string in Java.这个小代码片段/函数将有效地从 Java 中的字符串中提取 URL 字符串。 I found the basic regex for doing it here, and used it in a java function.我在这里找到了基本的正则表达式,并在 java 函数中使用了它。

I expanded on the basic regex a bit with the part “|www[.]” in order to catch links not starting with “http://”我用“|www[.]”部分对基本正则表达式进行了扩展,以捕获不以“http://”开头的链接

Enough talk (it is cheap), here's the code:够了(它很便宜),这是代码:

//Pull all links from the body for easy retrieval
private ArrayList pullLinks(String text) {
ArrayList links = new ArrayList();

String regex = "\\(?\\b(http://|www[.])[-A-Za-z0-9+&amp;@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&amp;@#/%=~_()|]";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
while(m.find()) {
String urlStr = m.group();
if (urlStr.startsWith("(") &amp;&amp; urlStr.endsWith(")"))
{
urlStr = urlStr.substring(1, urlStr.length() - 1);
}
links.add(urlStr);
}
return links;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM