[英]Regex not match pattern followed by horizontal ellipsis in string
I am trying to extract Twitter hashtags from text using regex in R, using str_match_all
from the "stringr" package. 我正在尝试使用R中的正则表达式,使用“字符串”包中的
str_match_all
,从文本中提取Twitter主题标签。
The problem is that sometimes the hashtag gets truncated, with a horizontal ellipsis character appended to the end of the text string, as shown in this example: 问题在于,有时标签会被截断,并在文本字符串的末尾附加一个水平省略号字符,如以下示例所示:
str_match_all("hello #goodbye #au…","#[[:alnum:]_+]*[^…]")[[1]]
I can successfully extract a list of hashtags, using the above code, but I want to exclude hashtags that are truncated (ie that have a horizontal ellipsis character). 使用上面的代码,我可以成功地提取主题标签列表,但是我想排除被截断的主题标签(即具有水平省略号字符的主题标签)。
This is frustrating as I have looked everywhere for a solution, and the above code is the best I can come up with, but clearly does not work. 当我到处寻找解决方案时,这令人沮丧,上面的代码是我能想到的最好的代码,但显然行不通。
Any help is deeply appreciated. 任何帮助深表感谢。
I suggest using regmatches
with regexpr
and the #[^#]+(?!…)\\\\b
Perl-style regex: 我建议将
regmatches
与regexpr
和#[^#]+(?!…)\\\\b
Perl风格的正则表达式结合使用:
x <- "#hashtag1 notHashtag #hashtag2 notHashtag #has…"
m <- gregexpr('#[^#\\s]+(?!…)\\b', x, perl=T)
// or m <- gregexpr('#\\w+(?!…)\\b', x, perl=T)
// or m <- gregexpr('#\\S+(?!…)\\b', x, perl=T)
regmatches(x, m)
See demo on CodingGround 参见有关CodingGround的演示
The regex means: 正则表达式表示:
#
- Literal #
#
-文字#
[^#]+
- 1 or more characters other then #
(or \\\\w+
to match alphanumerics and underscore only, or \\\\S+
that will match any number of non-whitespace characters) [^#]+
- 1个或多个字符其他然后#
(或\\\\w+
匹配字母数字和仅下划线或\\\\S+
,将匹配的任何数量的非空白字符) (?!…)\\\\b
- Match a word boundary that is not preceded by a …
(?!…)\\\\b
匹配不带…
的单词边界 Result of the above code execution: [1] "#goodbye"
以上代码执行的结果:
[1] "#goodbye"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.