简体   繁体   English

正则表达式从推文中提取@name符号

[英]regular expression to extract @name symbols from tweet

I would like to use regular expression to extract only @patrick @michelle from the following sentence: 我想使用正则表达式从以下句子中仅提取@patrick @michelle

@patrick  @michelle we having diner @home tonight do you want to join?

Note: @home should not be include in the result because, it is not at beginning of the sentence nor is followed by another @name . 注意: @home不应包含在结果中,因为它不是在句子的开头,也不在后面是另一个@name

Any solution, tip, comments will be really appreciated. 任何解决方案,提示,评论将不胜感激。

/(?:(?:@\S+\s+)+|^)@\S+/g

It first matches either an "@" followed by many non-space characters, or the start of line, and then matches another "@" followed by many non-space characters. 它首先匹配后跟许多空格字符的“ @”或行首,然后匹配后跟许多空格字符的另一个“ @”。

Note that it's common in Twitter that @name is preceded by RT , appears in the middle or end of the tweet eg http://twitter.com/ceetee/statuses/9874073403 . 请注意,在Twitter中@name前面@name RT是很常见的,它出现在tweet的中间或结尾,例如http://twitter.com/ceetee/statuses/9874073403 Basically you can't distinguish whether a @name is really a name just using RegEx or even a parser. 基本上,仅使用RegEx甚至解析器就无法区分@name是不是真的名称。 The best bet is to check if http://twitter.com/name 404 or not. 最好的选择是检查是否http://twitter.com/name 404。

Well, at first I thought this failed because I looked at the groups that are returned: 好吧,起初我以为这失败了,因为我查看了返回的组:

>>> tw = re.compile(r"^((@\w*)\s+)*")
>>> tw.findall(tweet)
[('@michelle ', '@michelle')]
>>> tw.match(tweet).groups()
('@michelle ', '@michelle')

Note that the groups only keep the last value for any group in the re. 请注意,组仅保留re中任何组的最后一个值。 But if you just grab group(), then you get the whole matched string: 但是,如果您只是抓住group(),那么您将获得整个匹配的字符串:

>>> tw.match(tweet).group()
'@patrick  @michelle '

For grins, I'll try pyparsing: 对于笑容,我将尝试pyparsing:

>>> from pyparsing import Word, printables, OneOrMore
>>> atName = Word("@",printables)
>>> OneOrMore(atName).parseString(tweet).asList()
['@patrick', '@michelle']

Try this regular expression: 试试这个正则表达式:

/^\s*@(\w+)\s+@(\w+)/

\\s denotes whitespace characters and \\w word characters. \\s表示空格字符,而\\w表示单词字符。

As long as it starts with an @ and continues with those this will do it I tested it in poweshell so some regex engines are a bit different. 只要它以@开头并继续执行这些操作,我都会在poweshell中对其进行测试,因此某些正则表达式引擎会有所不同。 This should also catch n names at the beginning of the line 这也应该在行首捕获n个名称

"^((@\\w+)\\s)+" “^((@ \\ W +)\\ S)+”

也许像这样,尽管您必须将匹配组中的任何内容都在空白处拆分以提取多个ID。

/^\s*(@\w+\s+)*\s+.*$/

You have tagged your post c#, so I assume you can use the .NET Regex imnplementation. 您已经标记了您的帖子c#,所以我假设您可以使用.NET Regex实现。 Using .NET, the following Regex will do: 使用.NET,以下正则表达式将起作用:

(?<![^@]\w+\s+)(@\w+)

This will match any words starting with @, that do not have a word without @ before them. 这将匹配以@开头的所有单词,在它们之前没有@的单词。 Note that "dinner @home @8pm" will still break it, though. 注意,“ dinner @home @ 8pm”仍然会破坏它。

See here for more details. 有关更多详细信息,请参见此处

for PHP 对于PHP

/^\s*@(\w+)\s+@(\w+)/

Thanks KennyM 谢谢肯尼

in python 在python中

msg = '@patrick  @michelle we having diner @home tonight do you want to join?'
import re
re.findall('(?:(?:@\S+\s+)+|^)@\S+', msg)

This works with 1 or n @name at the beginning of the sentence. 这适用于句子开头的1或n @name。

Thank you all for the quick replies. 谢谢大家的快速回复。

In Perl, you can exploit the /g match-more-than-once modifier combined with the \\G zero-width where-we-left-off assertion and list context, thus: 在Perl中,您可以利用/g比一次匹配的修饰符结合\\G零宽度where-we-left-off断言和列表上下文,因此:

my $str = '@patrick  @michelle we having diner @home tonight do you want to join?';
my @matches = ($str =~ m/\G(\@\w+)\s*/g);

print join(', ', @matches) . "\n";

This should be robust across any number of initial @-strings. 这对于任何数量的初始@字符串都应该是可靠的。

For Python check out: http://github.com/BonsaiDen/AtarashiiFormat 对于Python,请查看: http : //github.com/BonsaiDen/AtarashiiFormat
It will also give you the links and the tags. 它还将为您提供链接和标签。

And beware of using a simple regex, you will end up with a big mess, as I did before I converted the Twitter Text Java Library. 并要避免使用简单的正则表达式,结果就像在转换Twitter Text Java库之前所做的那样,将导致一团糟。

For C# I would do as follows: 对于C#,我将执行以下操作:

@([A-Za-z0-9-_&;]+) @([A-ZA-Z0-9 -_&;] +)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM