[英]regular expression to extract @name symbols from tweet
I would like to use regular expression to extract only @patrick @michelle
from the following sentence: 我想使用正则表达式从以下句子中仅提取
@patrick @michelle
:
@patrick @michelle we having diner @home tonight do you want to join?
Note: @home
should not be include in the result because, it is not at beginning of the sentence nor is followed by another @name
. 注意:
@home
不应包含在结果中,因为它不是在句子的开头,也不在后面是另一个@name
。
Any solution, tip, comments will be really appreciated. 任何解决方案,提示,评论将不胜感激。
/(?:(?:@\S+\s+)+|^)@\S+/g
It first matches either an "@" followed by many non-space characters, or the start of line, and then matches another "@" followed by many non-space characters. 它首先匹配后跟许多空格字符的“ @”或行首,然后匹配后跟许多空格字符的另一个“ @”。
Note that it's common in Twitter that @name
is preceded by RT
, appears in the middle or end of the tweet eg http://twitter.com/ceetee/statuses/9874073403 . 请注意,在Twitter中
@name
前面@name
RT
是很常见的,它出现在tweet的中间或结尾,例如http://twitter.com/ceetee/statuses/9874073403 。 Basically you can't distinguish whether a @name is really a name just using RegEx or even a parser. 基本上,仅使用RegEx甚至解析器就无法区分@name是不是真的名称。 The best bet is to check if
http://twitter.com/name
404 or not. 最好的选择是检查是否
http://twitter.com/name
404。
Well, at first I thought this failed because I looked at the groups that are returned: 好吧,起初我以为这失败了,因为我查看了返回的组:
>>> tw = re.compile(r"^((@\w*)\s+)*")
>>> tw.findall(tweet)
[('@michelle ', '@michelle')]
>>> tw.match(tweet).groups()
('@michelle ', '@michelle')
Note that the groups only keep the last value for any group in the re. 请注意,组仅保留re中任何组的最后一个值。 But if you just grab group(), then you get the whole matched string:
但是,如果您只是抓住group(),那么您将获得整个匹配的字符串:
>>> tw.match(tweet).group()
'@patrick @michelle '
For grins, I'll try pyparsing: 对于笑容,我将尝试pyparsing:
>>> from pyparsing import Word, printables, OneOrMore
>>> atName = Word("@",printables)
>>> OneOrMore(atName).parseString(tweet).asList()
['@patrick', '@michelle']
Try this regular expression: 试试这个正则表达式:
/^\s*@(\w+)\s+@(\w+)/
\\s
denotes whitespace characters and \\w
word characters. \\s
表示空格字符,而\\w
表示单词字符。
As long as it starts with an @ and continues with those this will do it I tested it in poweshell so some regex engines are a bit different. 只要它以@开头并继续执行这些操作,我都会在poweshell中对其进行测试,因此某些正则表达式引擎会有所不同。 This should also catch n names at the beginning of the line
这也应该在行首捕获n个名称
"^((@\\w+)\\s)+" “^((@ \\ W +)\\ S)+”
也许像这样,尽管您必须将匹配组中的任何内容都在空白处拆分以提取多个ID。
/^\s*(@\w+\s+)*\s+.*$/
You have tagged your post c#, so I assume you can use the .NET Regex imnplementation. 您已经标记了您的帖子c#,所以我假设您可以使用.NET Regex实现。 Using .NET, the following Regex will do:
使用.NET,以下正则表达式将起作用:
(?<![^@]\w+\s+)(@\w+)
This will match any words starting with @, that do not have a word without @ before them. 这将匹配以@开头的所有单词,在它们之前没有@的单词。 Note that "dinner @home @8pm" will still break it, though.
注意,“ dinner @home @ 8pm”仍然会破坏它。
for PHP 对于PHP
/^\s*@(\w+)\s+@(\w+)/
Thanks KennyM 谢谢肯尼
in python 在python中
msg = '@patrick @michelle we having diner @home tonight do you want to join?'
import re
re.findall('(?:(?:@\S+\s+)+|^)@\S+', msg)
This works with 1 or n @name at the beginning of the sentence. 这适用于句子开头的1或n @name。
Thank you all for the quick replies. 谢谢大家的快速回复。
In Perl, you can exploit the /g
match-more-than-once modifier combined with the \\G
zero-width where-we-left-off assertion and list context, thus: 在Perl中,您可以利用
/g
比一次匹配的修饰符结合\\G
零宽度where-we-left-off断言和列表上下文,因此:
my $str = '@patrick @michelle we having diner @home tonight do you want to join?';
my @matches = ($str =~ m/\G(\@\w+)\s*/g);
print join(', ', @matches) . "\n";
This should be robust across any number of initial @-strings. 这对于任何数量的初始@字符串都应该是可靠的。
For Python check out: http://github.com/BonsaiDen/AtarashiiFormat 对于Python,请查看: http : //github.com/BonsaiDen/AtarashiiFormat
It will also give you the links and the tags. 它还将为您提供链接和标签。
And beware of using a simple regex, you will end up with a big mess, as I did before I converted the Twitter Text Java Library. 并要避免使用简单的正则表达式,结果就像在转换Twitter Text Java库之前所做的那样,将导致一团糟。
For C# I would do as follows: 对于C#,我将执行以下操作:
@([A-Za-z0-9-_&;]+) @([A-ZA-Z0-9 -_&;] +)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.