使用正则表达式提取 URL？

Question

I have the following strings我有以下字符串

1. !abc.com
2. abc.com!
3. Hey there this is .abc.com!. This is amazing

I am trying to find a way such that I can identify special characters before or after the end of the URL in the string and add in a space only if the special character is at the beginning or end of the string, eg我正在尝试找到一种方法，以便我可以在字符串中 URL 结束之前或之后识别特殊字符，并且仅当特殊字符位于字符串的开头或结尾时才添加空格，例如

!abc.com -> ! abc.com
abc.com! -> abc.com !
Hey there this is .abc.com!. This is amazing -> Hey there this is . abc.com !.This is amazing

What would be a good way to handle this scenario?处理这种情况的好方法是什么？

I tried the following regex: re.match('^.*$',w) .我尝试了以下正则表达式： re.match('^.*$',w) 。 But this seems very generic.但这似乎很笼统。 Any advice or suggestion would be greatly appreciated.任何意见或建议将不胜感激。

Answer 1

The trick is to:诀窍是：

Find all URLs in the string查找字符串中的所有 URL
Build a new (empty) string构建一个新的（空）字符串
For every URL match对于每个 URL 匹配
- Add the text up until the match添加文本直到匹配
- Look just before the URL, add whitespace if needed在 URL 之前查看，如果需要添加空格
- Look just after the URL, add whitespace if needed在 URL 之后查看，如果需要添加空格
- repeat重复
Add the end of the original string添加原始字符串的结尾

This should work:这应该有效：

import re
import string

# Your input texts + one extreme case with multiple URLs
texts = [
    "!abc.com",
    "abc.com!",
    "Hey there this is .abc.com!. This is amazing",
    "Hey there this is .abc.com!. This is amazing... Hey there this is .abc.com!. This is amazing",
]


# From (match any URL): https://www.regextester.com/93652
pattern = r"(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?"

# Loop the texts
for text in texts:

    # Start building the new text
    new_text = ""
    position = 0

    # Loop over the matches
    for match in re.finditer(pattern, text):

        # Extract the start and end positions of the match (URL)
        start, end = match.span()

        # Add until the start of this match
        new_text += text[position:start]

        # Check the character just before the match
        if start > 0:
            if text[start - 1] in string.punctuation:
                # Add a space
                new_text += " "

        # Add the actual match
        new_text += text[start:end]

        # Check the character after the match
        if end < len(text):
            if text[end] in string.punctuation:
                # Add a space
                new_text += " "

        # Move to the end of the match
        position = end

    # Add the end of the original string
    new_text += text[position:]

    # Show the new string
    print(new_text)

Output: Output：

! abc.com
abc.com !
Hey there this is . abc.com !. This is amazing
Hey there this is . abc.com !. This is amazing... Hey there this is . abc.com !. This is amazing

使用正则表达式提取 URL？

问题描述

1 个解决方案

解决方案1
0 2021-06-25 14:00:18

使用正则表达式提取 URL？

问题描述

1 个解决方案

解决方案1 0 2021-06-25 14:00:18

解决方案1
0 2021-06-25 14:00:18