正则表达式不匹配模式，后跟字符串中的水平省略号

Question

I am trying to extract Twitter hashtags from text using regex in R, using str_match_all from the "stringr" package. 我正在尝试使用R中的正则表达式，使用“字符串”包中的str_match_all ，从文本中提取Twitter主题标签。

The problem is that sometimes the hashtag gets truncated, with a horizontal ellipsis character appended to the end of the text string, as shown in this example: 问题在于，有时标签会被截断，并在文本字符串的末尾附加一个水平省略号字符，如以下示例所示：

str_match_all("hello #goodbye #au…","#[[:alnum:]_+]*[^…]")[[1]]

I can successfully extract a list of hashtags, using the above code, but I want to exclude hashtags that are truncated (ie that have a horizontal ellipsis character). 使用上面的代码，我可以成功地提取主题标签列表，但是我想排除被截断的主题标签（即具有水平省略号字符的主题标签）。

This is frustrating as I have looked everywhere for a solution, and the above code is the best I can come up with, but clearly does not work. 当我到处寻找解决方案时，这令人沮丧，上面的代码是我能想到的最好的代码，但显然行不通。

Any help is deeply appreciated. 任何帮助深表感谢。

Answer 1

I suggest using regmatches with regexpr and the #[^#]+(?!…)\\\\b Perl-style regex: 我建议将regmatches与regexpr和#[^#]+(?!…)\\\\b Perl风格的正则表达式结合使用：

x <- "#hashtag1 notHashtag #hashtag2 notHashtag #has…"
m <- gregexpr('#[^#\\s]+(?!…)\\b', x, perl=T)
// or m <- gregexpr('#\\w+(?!…)\\b', x, perl=T)
// or m <- gregexpr('#\\S+(?!…)\\b', x, perl=T)
regmatches(x, m)

See demo on CodingGround 参见有关CodingGround的演示

The regex means: 正则表达式表示：

# - Literal # # -文字#
[^#]+ - 1 or more characters other then # (or \\\\w+ to match alphanumerics and underscore only, or \\\\S+ that will match any number of non-whitespace characters) [^#]+ - 1个或多个字符其他然后# （或\\\\w+匹配字母数字和仅下划线或\\\\S+ ，将匹配的任何数量的非空白字符）
(?!…)\\\\b - Match a word boundary that is not preceded by a … (?!…)\\\\b匹配不带…的单词边界

Result of the above code execution: [1] "#goodbye" 以上代码执行的结果： [1] "#goodbye"

正则表达式不匹配模式，后跟字符串中的水平省略号

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-06-11 09:17:06

正则表达式不匹配模式，后跟字符串中的水平省略号

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-06-11 09:17:06

解决方案1
1 已采纳 2015-06-11 09:17:06