简体   繁体   English

行首的可选匹配

[英]Optional match for beginning of line

I am trying to create a regular expression in Python that matches #hashtags. 我正在尝试在Python中创建一个与#hashtags匹配的正则表达式。 My definition on a hashtag is: 我对主题标签的定义是:

  • It is a work that starts with a # 这是一个以#开头的作品
  • It can contain all characters except [ ,\\.] 它可以包含除[ ,\\.]之外的所有字符[ ,\\.]
  • It can be anywhere in the text 它可以在文本中的任何位置

So in this text 所以在本文中

#This string cont#ains #four, and #only four #hashtags.

The hashes here are This , four , only and hashtags . 这里的哈希是Thisfouronlyhashtags

The problem I have is the optional check for the beginning of line. 我的问题是行首的可选检查。

  • [ \\.,]+ won't do it since it won't match the optional beginning. [ \\.,]+不会执行此操作,因为它与可选的开头不匹配。
  • [ \\.,]? won't do it since it matches too much. 因为它匹配太多,所以不会这样做。

Example with + +示例

In []: re.findall('[ \.,]+#([^ \.,]+)', '#This string cont#ains #four, and #only four #hashtags.')
Out[]: ['four', 'only', 'hashtags']

Example with ? 以?为例

In []: re.findall('[ \.,]?#([^ \.,]+)', '#This string cont#ains #four, and #only four #hashtags.')
Out[]: ['This', 'ains', 'four', 'only', 'hashtags']

How can optional match the beginning of the line? 可选内容如何匹配行首?

This seems to work: 这似乎可行:

>>> re.findall(r'\B#([^,\W]+)', '#This string cont#ains #four, and #only four #hashtags.')
['This', 'four', 'only', 'hashtags']
  • \\B : Matches the empty string, but only when it is not at the beginning or end of a word. \\B :匹配空字符串,但仅当它不在单词的开头或结尾时才匹配。 This means that r'py\\B' matches 'python' , 'py3' , 'py2' , but not 'py' , 'py.' 这意味着r'py\\B'匹配'python''py3''py2' ,但不匹配'py''py.' , or 'py!' 'py!' . \\B is just the opposite of \\b , so is also subject to the settings of LOCALE and UNICODE . \\B\\b相反,因此也受LOCALEUNICODE的设置的限制。
  • \\W : When the LOCALE and UNICODE flags are not specified, matches any non-alphanumeric character; \\W :未指定LOCALEUNICODE标志时,匹配任何非字母数字字符;否则,不匹配。 this is equivalent to the set [^a-zA-Z0-9_] . 这等效于集合[^a-zA-Z0-9_] With LOCALE, it will match any character not in the set [0-9_] , and not defined as alphanumeric for the current locale. 使用LOCALE,它将匹配不在集合[0-9_]且未定义为当前语言环境的字母数字的任何字符。 If UNICODE is set, this will match anything other than [0-9_] plus characters classied as not alphanumeric in the Unicode character properties database. 如果设置了UNICODE ,则它将匹配[0-9_]以及Unicode字符属性数据库中归类为非字母数字字符之外的任何字符。

Before your regex you can just tell what you don't want. 在使用正则表达式之前,您只需说出不需要的内容即可。

(?<!\w)(#[^ \.,]+)

With negative lookbehind you can do that 有了负面的眼神,你可以做到这一点

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM