简体   繁体   English

删除特定单词后的文本,R 中的某些字符除外

[英]Remove text after a specific word except certain characters in R

First post: Let me know if I'm posting in the wrong place.第一篇文章:如果我发错地方了,请告诉我。 I'm looking to remove text from a lot of data i R.我希望从 R 的大量数据中删除文本。 Each line(string?) looks like this:每行(字符串?)看起来像这样:

example_sentence <- "John Doe and Jane Doe (C)"

I would like to keep only the first name in every sentence and the parenthesis (including what's in it).我想只保留每个句子中的名字和括号(包括其中的内容)。 Every parenthesis contains one or two letters (both in capital and lower case letters)每个括号包含一个或两个字母(大写和小写字母)

What I've tried:我试过的:

example_sentence %>% str_remove("and.*")

This obviously removes the parenthesis.这显然删除了括号。 Just getting to know regexpr.刚刚了解正则表达式。 Looking for something like:寻找类似的东西:

[^(*)]

Can't get it to work.无法让它工作。 Any thoughts?有什么想法吗?

EDIT: Here's some more input as requested.编辑:这是根据要求提供的更多输入。 Maybe it will help others!也许它会帮助别人! (och = and in Swedish) (och = 瑞典语)

[1] "Anders Ahlgren och Anders Åkesson (C)"           
[2] "Karin Nilsson (C)"                               
[3] "Edward Riedl (M)"                                
[4] "Per-Ingvar Johnsson och Anders Åkesson (C)"      
[5] "Per-Ingvar Johnsson och Annika Qarlsson (C)"     
[6] "Annika Qarlsson och Ulrika Carlsson i Skövde (C)"

Expected output:预期 output:

[1] "Anders Ahlgren (C)"           
[2] "Karin Nilsson (C)"                               
[3] "Edward Riedl (M)"                                
[4] "Per-Ingvar Johnsson (C)"      
[5] "Per-Ingvar Johnsson (C)"     
[6] "Annika Qarlsson (C)"

The [^(*)] pattern matches any single character but ( , * and ) and str_remove removes all these characters from anywhere in the string. [^(*)]模式匹配任何单个字符,但(*)str_remove从字符串中的任何位置删除所有这些字符。

If you plan to remove a word and and any chars other than ( and ) after it, you may use如果您打算删除一个单词and它后面的()以外的任何字符,您可以使用

 example_sentence %>% str_remove("\\band\\b[^()]*")

Or, using base R:或者,使用基础 R:

sub("\\band\\b[^()]*", "", example_sentence)

The pattern matches:模式匹配:

  • \band\b - a whole word and ( \b is a word boundary) \band\b - 一个完整的单词and\b是单词边界)
  • [^()]* - any char, 0 or more occurrences, other than ( and ) . [^()]* - 除()以外的任何字符,出现 0 次或多次。

See the regex demo and an R demo .请参阅正则表达式演示R 演示 See also the regex graph :另见 正则表达式图

在此处输入图像描述

Try this:尝试这个:

example_sentence <- "John Doe and Jane Doe (C)"

spliting <- function(x)
{
  y <- strsplit(x,split = ' ')
  z <- y[[1]]
  z <- z[c(1,length(z))]
  return(z)
}

spliting(example_sentence)

[1] "John" "(C)"

You might be able to do this with capture groups.您可以使用捕获组来执行此操作。 As Ronak says, a few more example input/outputs would be helpful as I'm not sure we know 100% all the possible forms you have in your data.正如 Ronak 所说,更多示例输入/输出会有所帮助,因为我不确定我们是否 100% 了解您数据中所有可能的 forms。

Here is a start in any case:无论如何,这是一个开始:

gsub('and.*(\\([^)]*\\)).*', '\\1', example_sentence)
# [1] "John Doe (C)"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM