简体   繁体   English

为什么这个正则表达式返回一个空列表?

[英]Why does this regex return an empty list?

New programmer here.. I am trying to get all of the hashtags and links from a string.新程序员在这里.. 我正在尝试从字符串中获取所有主题标签和链接。 The regular expressions return the desired result when on their own;正则表达式单独返回所需的结果; however, an empty list is returned when they are combined.但是,当它们组合在一起时,会返回一个空列表。 How can one fix this?如何解决这个问题?

import re

tweet = ('New PyBites article: Module of the Week - Requests-cache '
     'for Repeated API Calls - http://pybit.es/requests-cache.html '
     '#python #APIs')


# Get all hashtags and links from tweet
def get_hashtags_and_links(tweet=tweet):
tweet_regex = re.compile(r'''(
                         \(#\w+\)
                         \(https://[^\s]+\)
                         )''', re.VERBOSE)

tweet_object = tweet_regex.findall(tweet)
print(tweet_object)

get_hashtags_and_links()

you are looking for #\w+ (enclosed in literal parenthesis) immediately followed by https://[^\s]+ (also enclosed in literal parenthesis) which appears no where in your text您正在寻找#\w+ (括在文字括号中)紧随其后的是https://[^\s]+ (也包含在文字括号中),它在您的文本中没有出现

instead use the |而是使用| or bar或酒吧

re.compile(r'''(
            \(#\w+\)|
            \(https://[^\s]+\)
                     )''', re.VERBOSE)

but as pointed out \( is looking for an actual parenthesis (it is not grouping)但正如所指出的\(正在寻找一个实际的括号(它不是分组)

so you probably just want所以你可能只是想要

"(#\w+)|(https?://[^\s]+)"

you can use non-capturing groups( (?:...) ) if you want as well如果您愿意,也可以使用非捕获组( (?:...)

"((?:#\w+)|(?:https?://[^\s]+))"

You can use the regex as follows:您可以按如下方式使用正则表达式:

    http_hash_search = re.compile(r"(\w+:\/\/\S+)|(#[A-Za-z0-9]+)")

#[A-Za-z0-9]+ --- This will search for #hashtag followed by any number or letters #[A-Za-z0-9]+ --- 这将搜索#hashtag,后跟任何数字或字母

(\w+://\S+) --- This will search for paths on the tweets (\w+://\S+) --- 这将搜索推文上的路径

Whatever you wanted to search for with your regex, you need to make sure you escape # char that is special in case you compile the regex with re.X / re.VERBOSE flag .无论你想用你的正则表达式搜索什么,你都需要确保你转义# char 这是特殊的,以防你用re.X / re.VERBOSE flag编译正则表达式。 This option enables comments inside the regex pattern that start with an unescaped hash symbol and go on till the line end.此选项启用正则表达式模式中的注释,这些注释以非转义的 hash 符号和 go 开头,直到行尾。

When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.当一行包含一个不在字符 class 中的#并且前面没有未转义的反斜杠时,从最左边的这种#到行尾的所有字符都将被忽略。

So, assuming you want to match either hashtags or specific URLs you may use因此,假设您想匹配您可能使用的主题标签或特定 URL

tweet_regex = re.compile(r'''
                     \#\w+             # Hashtag pattern
                     |                 # or
                     https?://\S+      # URLs
                     ''', re.VERBOSE)

See the Python code demo , output:参见Python 代码演示output:

['http://pybit.es/requests-cache.html', '#python', '#APIs']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM