繁体   English   中英

如何在R中删除带有连词的句子

[英]how to remove sentences with conjuctions in R

我有文字,示例如下

输入

c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")

预期的输出是

,At the end of the study everything was great\n,Some other sentence\nThe test ended.
,Not sure how to get this regex sorted\n\nHow do I do this

我试过了:

  x[, y] <- gsub(".*[Bb]ut .*?(\\.|\n|:)", "", x[, y])

但它消除了整个句子。 如何删除其中带有“ but”的短语,并在每个句子中保留其余短语?

请注意,您混合使用了“ \\ n”和“ / n”,这是我正确的。

我对解决方案的想法:

1)只需捕获“ but”之前和之后没有换行符([^ \\ n])的所有字符。

2) (编辑)为了解决Wiktors发现的问题,我们还必须检查在“ but”之前或之后是否没有字符([^ a-zA-Z])。

x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",
       ",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")

> gsub("[^\n]*[^a-zA-Z]but[^a-zA-Z][^\n]*", "", x)
[1] ",At the end of the study everything was great\n\nSome other sentence\n The test ended."
[2] ",Not sure how to get this regex sorted\n\nHow do I do this" 

您可以使用

x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.", ",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
gsub(".*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE, perl=TRUE)
gsub("(?n).*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE)

在线观看R演示

PCRE模式匹配:

  • .* -除换行符以外的任何0+个字符,0或更多,尽可能多
  • \\\\bbut\\\\b一个完整的单词, but\\b是单词边界)
  • .* -除换行符以外的任何0+个字符,0或更多,尽可能多
  • [\\r\\n]* -0个或多个换行符。

请注意,第一个gsub具有perl=TRUE参数,该参数使R使用PCRE正则表达式引擎来解析模式,并且. 与此处的换行符不匹配。 第二个gsub使用TRE(默认)正则表达式引擎,需要使用(?n)内联修饰符make . 无法匹配换行符。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM