[英]Why does this regex return an empty list?
New programmer here.. I am trying to get all of the hashtags and links from a string.新程序员在这里.. 我正在尝试从字符串中获取所有主题标签和链接。 The regular expressions return the desired result when on their own;
正则表达式单独返回所需的结果; however, an empty list is returned when they are combined.
但是,当它们组合在一起时,会返回一个空列表。 How can one fix this?
如何解决这个问题?
import re
tweet = ('New PyBites article: Module of the Week - Requests-cache '
'for Repeated API Calls - http://pybit.es/requests-cache.html '
'#python #APIs')
# Get all hashtags and links from tweet
def get_hashtags_and_links(tweet=tweet):
tweet_regex = re.compile(r'''(
\(#\w+\)
\(https://[^\s]+\)
)''', re.VERBOSE)
tweet_object = tweet_regex.findall(tweet)
print(tweet_object)
get_hashtags_and_links()
you are looking for #\w+
(enclosed in literal parenthesis) immediately followed by https://[^\s]+
(also enclosed in literal parenthesis) which appears no where in your text您正在寻找
#\w+
(括在文字括号中)紧随其后的是https://[^\s]+
(也包含在文字括号中),它在您的文本中没有出现
instead use the |
而是使用
|
or bar或酒吧
re.compile(r'''(
\(#\w+\)|
\(https://[^\s]+\)
)''', re.VERBOSE)
but as pointed out \(
is looking for an actual parenthesis (it is not grouping)但正如所指出的
\(
正在寻找一个实际的括号(它不是分组)
so you probably just want所以你可能只是想要
"(#\w+)|(https?://[^\s]+)"
you can use non-capturing groups( (?:...)
) if you want as well如果您愿意,也可以使用非捕获组(
(?:...)
)
"((?:#\w+)|(?:https?://[^\s]+))"
You can use the regex as follows:您可以按如下方式使用正则表达式:
http_hash_search = re.compile(r"(\w+:\/\/\S+)|(#[A-Za-z0-9]+)")
#[A-Za-z0-9]+ --- This will search for #hashtag followed by any number or letters #[A-Za-z0-9]+ --- 这将搜索#hashtag,后跟任何数字或字母
(\w+://\S+) --- This will search for paths on the tweets (\w+://\S+) --- 这将搜索推文上的路径
Whatever you wanted to search for with your regex, you need to make sure you escape #
char that is special in case you compile the regex with re.X
/ re.VERBOSE
flag .无论你想用你的正则表达式搜索什么,你都需要确保你转义
#
char 这是特殊的,以防你用re.X
/ re.VERBOSE
flag编译正则表达式。 This option enables comments inside the regex pattern that start with an unescaped hash symbol and go on till the line end.此选项启用正则表达式模式中的注释,这些注释以非转义的 hash 符号和 go 开头,直到行尾。
When a line contains a
#
that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such#
through the end of the line are ignored.当一行包含一个不在字符 class 中的
#
并且前面没有未转义的反斜杠时,从最左边的这种#
到行尾的所有字符都将被忽略。
So, assuming you want to match either hashtags or specific URLs you may use因此,假设您想匹配您可能使用的主题标签或特定 URL
tweet_regex = re.compile(r'''
\#\w+ # Hashtag pattern
| # or
https?://\S+ # URLs
''', re.VERBOSE)
See the Python code demo , output:参见Python 代码演示output:
['http://pybit.es/requests-cache.html', '#python', '#APIs']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.