[英]Optional match for beginning of line
I am trying to create a regular expression in Python that matches #hashtags. 我正在尝试在Python中创建一个与#hashtags匹配的正则表达式。 My definition on a hashtag is:
我对主题标签的定义是:
#
#
开头的作品 [ ,\\.]
[ ,\\.]
之外的所有字符[ ,\\.]
So in this text 所以在本文中
#This string cont#ains #four, and #only four #hashtags.
The hashes here are This
, four
, only
and hashtags
. 这里的哈希是
This
, four
, only
和hashtags
。
The problem I have is the optional check for the beginning of line. 我的问题是行首的可选检查。
[ \\.,]+
won't do it since it won't match the optional beginning. [ \\.,]+
不会执行此操作,因为它与可选的开头不匹配。 [ \\.,]?
won't do it since it matches too much. Example with + +示例
In []: re.findall('[ \.,]+#([^ \.,]+)', '#This string cont#ains #four, and #only four #hashtags.')
Out[]: ['four', 'only', 'hashtags']
Example with ? 以?为例
In []: re.findall('[ \.,]?#([^ \.,]+)', '#This string cont#ains #four, and #only four #hashtags.')
Out[]: ['This', 'ains', 'four', 'only', 'hashtags']
How can optional match the beginning of the line? 可选内容如何匹配行首?
This seems to work: 这似乎可行:
>>> re.findall(r'\B#([^,\W]+)', '#This string cont#ains #four, and #only four #hashtags.')
['This', 'four', 'only', 'hashtags']
\\B
: Matches the empty string, but only when it is not at the beginning or end of a word. \\B
:匹配空字符串,但仅当它不在单词的开头或结尾时才匹配。 This means that r'py\\B'
matches 'python'
, 'py3'
, 'py2'
, but not 'py'
, 'py.'
r'py\\B'
匹配'python'
, 'py3'
, 'py2'
,但不匹配'py'
, 'py.'
, or 'py!'
'py!'
. \\B
is just the opposite of \\b
, so is also subject to the settings of LOCALE
and UNICODE
. \\B
与\\b
相反,因此也受LOCALE
和UNICODE
的设置的限制。 \\W
: When the LOCALE
and UNICODE
flags are not specified, matches any non-alphanumeric character; \\W
:未指定LOCALE
和UNICODE
标志时,匹配任何非字母数字字符;否则,不匹配。 this is equivalent to the set [^a-zA-Z0-9_]
. [^a-zA-Z0-9_]
。 With LOCALE, it will match any character not in the set [0-9_]
, and not defined as alphanumeric for the current locale. [0-9_]
且未定义为当前语言环境的字母数字的任何字符。 If UNICODE
is set, this will match anything other than [0-9_]
plus characters classied as not alphanumeric in the Unicode character properties database. UNICODE
,则它将匹配[0-9_]
以及Unicode字符属性数据库中归类为非字母数字字符之外的任何字符。 Before your regex you can just tell what you don't want. 在使用正则表达式之前,您只需说出不需要的内容即可。
(?<!\w)(#[^ \.,]+)
With negative lookbehind you can do that 有了负面的眼神,你可以做到这一点
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.