简体   繁体   English

正则表达式不匹配模式,后跟字符串中的水平省略号

[英]Regex not match pattern followed by horizontal ellipsis in string

I am trying to extract Twitter hashtags from text using regex in R, using str_match_all from the "stringr" package. 我正在尝试使用R中的正则表达式,使用“字符串”包中的str_match_all ,从文本中提取Twitter主题标签。

The problem is that sometimes the hashtag gets truncated, with a horizontal ellipsis character appended to the end of the text string, as shown in this example: 问题在于,有时标签会被截断,并在文本字符串的末尾附加一个水平省略号字符,如以下示例所示:

str_match_all("hello #goodbye #au…","#[[:alnum:]_+]*[^…]")[[1]]

I can successfully extract a list of hashtags, using the above code, but I want to exclude hashtags that are truncated (ie that have a horizontal ellipsis character). 使用上面的代码,我可以成功地提取主题标签列表,但是我想排除被截断的主题标签(即具有水平省略号字符的主题标签)。

This is frustrating as I have looked everywhere for a solution, and the above code is the best I can come up with, but clearly does not work. 当我到处寻找解决方案时,这令人沮丧,上面的代码是我能想到的最好的代码,但显然行不通。

Any help is deeply appreciated. 任何帮助深表感谢。

I suggest using regmatches with regexpr and the #[^#]+(?!…)\\\\b Perl-style regex: 我建议将regmatchesregexpr#[^#]+(?!…)\\\\b Perl风格的正则表达式结合使用:

x <- "#hashtag1 notHashtag #hashtag2 notHashtag #has…"
m <- gregexpr('#[^#\\s]+(?!…)\\b', x, perl=T)
// or m <- gregexpr('#\\w+(?!…)\\b', x, perl=T)
// or m <- gregexpr('#\\S+(?!…)\\b', x, perl=T)
regmatches(x, m)

See demo on CodingGround 参见有关CodingGround的演示

The regex means: 正则表达式表示:

  • # - Literal # # -文字#
  • [^#]+ - 1 or more characters other then # (or \\\\w+ to match alphanumerics and underscore only, or \\\\S+ that will match any number of non-whitespace characters) [^#]+ - 1个或多个字符其他然后# (或\\\\w+匹配字母数字和仅下划线或\\\\S+ ,将匹配的任何数量的非空白字符)
  • (?!…)\\\\b - Match a word boundary that is not preceded by a (?!…)\\\\b匹配不带的单词边界

Result of the above code execution: [1] "#goodbye" 以上代码执行的结果: [1] "#goodbye"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM