[英]How do I use num_range to select rows which all contain the same first 4 digits in one specific column? (hoping to use dplyr/tidyverse)
my question is best asked in 2 parts:我的问题最好分为两部分:
I am dealing with a dataset that looks at forest product usage across many countries.我正在处理一个数据集,该数据集查看了许多国家/地区的林产品使用情况。 Each row represents a household from any one of these countries (about 30 total).
每行代表来自这些国家中任何一个国家的一个家庭(总共约 30 个)。 Each country has a code (4 digits), but in the dataset there is no column for country code.
每个国家都有一个代码(4 位),但数据集中没有国家代码列。 The way you can deduce which households came from which country is by using the household ID ("ghousehold").
您可以通过使用家庭 ID(“ghousehold”)来推断哪些家庭来自哪个国家/地区。 Ghousecode is a 7-digit code, the first 4 digits being the country code.
Ghousecode 是一个 7 位代码,前 4 位是国家代码。 For example, if Bolivia were country code: 3024, then a household in Bolivia could be 3024105 or 3024999...
例如,如果玻利维亚是国家代码:3024,那么玻利维亚的一个家庭可能是 3024105 或 3024999...
I want to have a code that selects all the entries for a specific country.我想要一个代码来选择特定国家/地区的所有条目。 I am using the tidyverse, so I thought of using select() and num_range() but it hasn't worked.
我正在使用 tidyverse,所以我想使用 select() 和 num_range() 但它没有用。 I don't get an error message, but when I look at my output I can tell it hasn't worked.
我没有收到错误消息,但是当我查看我的输出时,我可以看出它没有工作。 Here is my current code:
这是我当前的代码:
#forest_use_tibble is a tibble with observations on forest usage from many countries
#I selected a subset of the original file's variables.
forest_use_simpler <- select(forest_use_tibble, ghousecode, year, product, income, amount, unit)
#take Bolivia, whose country ID is 3024. This means that each ghousecode that begins with
3024 is from Bolivia.
#but each ghousecode is 3024xxx with three other numbers after it.
x = 3024
Bolivia <- select(forest_use_simpler, num_range("x", 001:999), everything())
#my goal: a new tibble/dataframe that has only the entries from Bolivia
#there is no separate column for country ID, unfortunately.
Any ideas?有任何想法吗?
Second part of the question: Is there a way to query just one of the columns (ie variables, in this case ghousecode) for the num_range?问题的第二部分:有没有办法只查询 num_range 的一列(即变量,在本例中为 ghousecode)? The way I have it above strikes me like it would search all variables in forest_use_simpler, so there is a chance that it may include another country's household if the digits 3024 appeared somewhere other than ghousecode.
我上面的方式让我印象深刻,就像它会搜索forest_use_simple中的所有变量一样,所以如果数字3024出现在ghousecode以外的其他地方,它就有可能包括另一个国家的家庭。
Thank you!谢谢!
(note: i have also tried putting in 3024 directly where x is to no avail. Thanks again for all help.) (注意:我也试过直接在 x 无效的地方输入 3024。再次感谢所有帮助。)
If the ghousecode
is consistently formatted with 7 digits, how about something like this?如果
ghousecode
的格式始终为 7 位数字,那么这样的事情怎么样?
library(tidyverse)
df <-
tibble(
ghousecode = c(2039434, 3024105),
year = c(2019, 2019)
)
df %>%
mutate(country_code = floor(ghousecode / 1000)) %>%
filter(country_code == 3024)
select
chooses columns, while filter
chooses rows. select
选择列,而filter
选择行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.