简体   繁体   English

如何删除 0 个或多个标记(单词),它们可能会按顺序出现或中间有数据出现?

[英]How to remove 0 or more tokens (words), where they might come up sequentially or with data in between?

How to extract only the country names from a variable such as the following.如何从如下变量中仅提取国家名称。

tibble::tribble(
    ~country, 
    '{"United States"}', 
    '{NULL}', 
    '{NULL,NULL}', 
    '{"United States",NULL,Netherlands}', 
    '{Germany}', 
    '{Canada}', 
    '{NULL,NULL}', 
    '{Chile,"United States"}', 
    '{NULL,NULL,NULL}', 
    '{NULL,China, NULL}', 
)

  • NULL can come up sequentially or not and up to 15 time in a single observation. NULL可以连续出现或不出现,一次观察最多出现 15 次。

  • Countries with multiple words, such as "United States" come up quoted, otherwise they are all unquoted.带有多个单词的国家,例如“美国”,会被引用,否则它们都不会被引用。

It is somewhat easy to do in multiple runs, such as removing all NULL s, then removing the duplicated commas, and then the parenthesis, but I was aiming for a more efficient way of achieving something towards the following:在多次运行中很容易做到,例如删除所有NULL ,然后删除重复的逗号,然后是括号,但我的目标是采用更有效的方法来实现以下目标:

tibble::tribble(
    ~country, 
    'United States', 
    NA, 
    NA, 
    'United States,Netherlands', 
    'Germany', 
    'Canada', 
    NA, 
    'Chile,United States', 
    NA, 
    'China', 
)

A bit brute-force with gsub s, but it works. gsub有点蛮力,但它有效。

dat$out <- gsub("^,|,$", "",
                trimws(gsub('NULL,?|["{}]', '', dat$country)))
dat
# # A tibble: 10 x 2
#    country                                out                        
#    <chr>                                  <chr>                      
#  1 "{\"United States\"}"                  "United States"            
#  2 "{NULL}"                               ""                         
#  3 "{NULL,NULL}"                          ""                         
#  4 "{\"United States\",NULL,Netherlands}" "United States,Netherlands"
#  5 "{Germany}"                            "Germany"                  
#  6 "{Canada}"                             "Canada"                   
#  7 "{NULL,NULL}"                          ""                         
#  8 "{Chile,\"United States\"}"            "Chile,United States"      
#  9 "{NULL,NULL,NULL}"                     ""                         
# 10 "{NULL,China, NULL}"                   "China"                    

From here, you can replace the empty strings with "" with从这里,您可以用""替换空字符串

dat$out[!nzchar(dat$out)] <- NA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 删除某些单词出现的数据 - Remove the data where certain words come 如何从标记中删除以数字开头的单词? - How to remove words that start with digits from tokens? Quanteda:如何查找短语中两个或多个单词的模式,当它们之间可以有任意数量的单词时? - Quanteda: How to look up patterns of two or more words in a phrase, when there can be any number of words in between? 如何要求套接字等待更多数据来 - How to ask the socket to wait for more data to come 删除字符串中两个单词之间的数据 - Remove data between two words in a string 如何删除R中斜线之间的所有单词? - How to remove all words between slash in R? 使用行索引从数据框中删除行,其中行索引可能是零长度向量 - Remove rows from data frame using row indices where row indices might be zero length vector 如何在 R 中的 Quanteda package 中应用正则表达式以删除连续重复的标记(单词) - How to apply regex in the Quanteda package in R to remove consecutively repeated tokens(words) 我如何防止在缺失的数据行多于2个的值之间进行插值? - How do I prevent interpolation between values where there are more than 2 missing rows of data? 如何在丢失的数据行数超过X的值之间防止插值? - How do I prevent interpolation between values where there are more than X number of missing rows of data?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM