简体   繁体   English

从特定单词的行中删除字符串

[英]Removing strings from rows for specific words

My Data looks like:我的数据看起来像:

Weather                           
   <chr>                             
 1 Snow Low clouds                   
 2 Snow Cloudy                       
 3 Drizzle Fog                       
 4 Thundershowers Partly cloudy      
 5 Thunderstorms More clouds than sun
 6 Sprinkles Partly cloudy           
 7 Heavy rain Broken clouds          
 8 Light rain Partly cloudy     

I am trying to use mutate to remove some text.我正在尝试使用mutate来删除一些文本。 For example I would like the above to look like:例如,我希望上面看起来像:

Weather                           
   <chr>                             
 1 Snow                   
 2 Snow                       
 3 Drizzle                      
 4 Thundershowers      
 5 Thunderstorms More clouds than sun
 6 Sprinkles Partly cloudy           
 7 Heavy rain           
 8 Light rain 

So I would like to remove the text after some specific words.所以我想删除一些特定单词之后的文本。 If I have a vector of the following:如果我有以下向量:

c("Snow", "Drizzle", "Heavy rain", "Light rain") 

Remove the text after these.删除这些后面的文字。 However I do not want to grep words such as Cloudy , Fog since they occure as their own row in the data but something like Snow Light fog can be cut down to Snow .但是,我不想grep诸如CloudyFog词,因为它们在数据中作为自己的行出现,但是诸如Snow Light fog类的东西可以分解为Snow

Data:数据:

d <- structure(list(Weather = c("Snow Low clouds", "Snow Cloudy", 
"Drizzle Fog", "Thundershowers Partly cloudy", "Thunderstorms More clouds than sun", 
"Sprinkles Partly cloudy", "Heavy rain Broken clouds", "Light rain Partly cloudy", 
"Rain showers Passing clouds", "Thundershowers Scattered clouds", 
"Thundershowers Passing clouds", "Light snow Overcast", "Snow Light fog", 
"Drizzle Broken clouds", "Light rain Fog", "Cloudy", "Thunderstorms Partly cloudy", 
"Heavy rain More clouds than sun", "Partly cloudy", NA)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -20L))

A general approach you can take here is to build a regex alternation of all target terms.您可以在此处采用的一般方法是构建所有目标术语的正则表达式交替。 Then, match those terms followed by anything up until the end of the input, and replace with just the term.然后,匹配这些术语后跟任何内容直到输入结束,并仅替换为术语。

terms <- c("Snow", "Drizzle", "Heavy rain", "Light rain")
regex <- paste0("\\b(", paste(terms, collapse="|"), ")\\b")
sub(paste0(regex, "\\s.*"), "\\1", d$Weather)

 [1] "Snow"                               "Snow"                              
 [3] "Drizzle"                            "Thundershowers Partly cloudy"      
 [5] "Thunderstorms More clouds than sun" "Sprinkles Partly cloudy"           
 [7] "Heavy rain"                         "Light rain"                        
 [9] "Rain showers Passing clouds"        "Thundershowers Scattered clouds"   
[11] "Thundershowers Passing clouds"      "Light snow Overcast"               
[13] "Snow"                               "Drizzle"                           
[15] "Light rain"                         "Cloudy"                            
[17] "Thunderstorms Partly cloudy"        "Heavy rain"                        
[19] "Partly cloudy"                      NA

Note that my output does not line up exactly with your expected output, but then again you did not include all target words in the suggested vector.请注意,我的输出与您的预期输出不完全一致,但是您再次没有在建议的向量中包含所有目标词。

The regex I used was:我使用的正则表达式是:

\b(Snow|Drizzle|Heavy rain|Light rain)\b

The trick here is that the above alternation is also a capture group, letting us easily replace the match with just the term you want.这里的技巧是上述交替也是一个捕获组,让我们可以轻松地用您想要的术语替换匹配项。 You may add more terms to this to get the desired output.您可以为此添加更多术语以获得所需的输出。

  • Maybe you can use the code below也许你可以使用下面的代码
v <- c("Snow", "Drizzle", "Heavy rain", "Light rain") 
pat <- paste0(v,collapse = "|")
unlist(regmatches(d$Weather,gregexpr(pat,d$Weather)))

such that以至于

> unlist(regmatches(d$Weather,gregexpr(pat,d$Weather)))
[1] "Snow"       "Snow"       "Drizzle"    "Heavy rain" "Light rain" "Snow"      
[7] "Drizzle"    "Light rain" "Heavy rain"
  • If you want to add the extracted value and append them to d in a new column, then you can use the following code:如果要添加提取的值并将它们附加到新列中的d ,则可以使用以下代码:
d <- within(d,X <- ifelse(grepl(pat,Weather),unlist(regmatches(Weather,gregexpr(pat,Weather))),NA))

such that以至于

> d
# A tibble: 20 x 2
   Weather                            X         
   <chr>                              <chr>     
 1 Snow Low clouds                    Snow      
 2 Snow Cloudy                        Snow      
 3 Drizzle Fog                        Drizzle   
 4 Thundershowers Partly cloudy       NA        
 5 Thunderstorms More clouds than sun NA        
 6 Sprinkles Partly cloudy            NA        
 7 Heavy rain Broken clouds           Drizzle   
 8 Light rain Partly cloudy           Light rain
 9 Rain showers Passing clouds        NA        
10 Thundershowers Scattered clouds    NA        
11 Thundershowers Passing clouds      NA        
12 Light snow Overcast                NA        
13 Snow Light fog                     Heavy rain
14 Drizzle Broken clouds              Light rain
15 Light rain Fog                     Snow      
16 Cloudy                             NA        
17 Thunderstorms Partly cloudy        NA        
18 Heavy rain More clouds than sun    Heavy rain
19 Partly cloudy                      NA        
20 NA                                 NA  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM