[英]How to remove 0 or more tokens (words), where they might come up sequentially or with data in between?
How to extract only the country names from a variable such as the following.如何从如下变量中仅提取国家名称。
tibble::tribble(
~country,
'{"United States"}',
'{NULL}',
'{NULL,NULL}',
'{"United States",NULL,Netherlands}',
'{Germany}',
'{Canada}',
'{NULL,NULL}',
'{Chile,"United States"}',
'{NULL,NULL,NULL}',
'{NULL,China, NULL}',
)
NULL
can come up sequentially or not and up to 15 time in a single observation. NULL
可以连续出现或不出现,一次观察最多出现 15 次。
Countries with multiple words, such as "United States" come up quoted, otherwise they are all unquoted.带有多个单词的国家,例如“美国”,会被引用,否则它们都不会被引用。
It is somewhat easy to do in multiple runs, such as removing all NULL
s, then removing the duplicated commas, and then the parenthesis, but I was aiming for a more efficient way of achieving something towards the following:在多次运行中很容易做到,例如删除所有NULL
,然后删除重复的逗号,然后是括号,但我的目标是采用更有效的方法来实现以下目标:
tibble::tribble(
~country,
'United States',
NA,
NA,
'United States,Netherlands',
'Germany',
'Canada',
NA,
'Chile,United States',
NA,
'China',
)
A bit brute-force with gsub
s, but it works. gsub
有点蛮力,但它有效。
dat$out <- gsub("^,|,$", "",
trimws(gsub('NULL,?|["{}]', '', dat$country)))
dat
# # A tibble: 10 x 2
# country out
# <chr> <chr>
# 1 "{\"United States\"}" "United States"
# 2 "{NULL}" ""
# 3 "{NULL,NULL}" ""
# 4 "{\"United States\",NULL,Netherlands}" "United States,Netherlands"
# 5 "{Germany}" "Germany"
# 6 "{Canada}" "Canada"
# 7 "{NULL,NULL}" ""
# 8 "{Chile,\"United States\"}" "Chile,United States"
# 9 "{NULL,NULL,NULL}" ""
# 10 "{NULL,China, NULL}" "China"
From here, you can replace the empty strings with ""
with从这里,您可以用""
替换空字符串
dat$out[!nzchar(dat$out)] <- NA
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.