[英]How to count observations matching the values of a vector of characters
I have a dataframe
with numerous observations and different type of variables.我有一个dataframe
,其中包含大量观察结果和不同类型的变量。 Here's a sample of my dataframe
:这是我的dataframe
的示例:
mydf <- structure(list(id = 1:16, Product = c("Pizza", "Cleaning Product",
"Chocolate", "Fruit", "Red Meat", "Cleaning Product", "Bracelet",
"Trucker Hat", "Shirt", "Shirt", "Chicken Breast", "Chocolate",
"Cereal", "Fruit", "Cleaning Product", "Trucker Hat"), price = c(2,
3.5, 1, 1, 2.5, 3.5, 3, 5, 15, 20, 2.5, 1, 2, 1, 3.5, 4), place = c("Supermarket",
"Supermarket", "Supermarket", "Little Store", "Supermarket",
"Supermarket", "Little Store", "Gas Station", "Supermarket",
"Supermarket", "Little Store", "Gas Station", "Gas Station",
"Little Store", "Supermarket", "Supermarket")), row.names = c(NA,
-16L), class = "data.frame")
# of observation # 观察 | Product产品 | Price in $价格 $ | Place地方 |
---|---|---|---|
1 1个 | Pizza比萨 | 2 2个 | Supermarket超级市场 |
2 2个 | Cleaning Product清洁产品 | 3.5 3.5 | Supermarket超级市场 |
3 3个 | Chocolate巧克力 | 1 1个 | Supermarket超级市场 |
4 4个 | Fruit水果 | 1 1个 | Little Store小店 |
5 5个 | Red Meat红肉 | 2.5 2.5 | Supermarket超级市场 |
6 6个 | Cleaning Product清洁产品 | 3.5 3.5 | Supermarket超级市场 |
7 7 | Bracelet手镯 | 3 3个 | Little Store小店 |
8 8个 | Trucker Hat卡车司机帽 | 5 5个 | Gas Station加油站 |
9 9 | Shirt衬衫 | 15 15 | Supermarket超级市场 |
10 10 | Shirt衬衫 | 20 20 | Supermarket超级市场 |
11 11 | Chicken Breast鸡胸肉 | 2.5 2.5 | Little Store小店 |
12 12 | Chocolate巧克力 | 1 1个 | Gas Station加油站 |
13 13 | Cereal谷物 | 2 2个 | Gas Station加油站 |
14 14 | Fruit水果 | 1 1个 | Little Store小店 |
15 15 | Cleaning Product清洁产品 | 3.5 3.5 | Supermarket超级市场 |
16 16 | Trucker Hat卡车司机帽 | 4 4个 | Supermarket超级市场 |
I also have a vector
of characters
:我还有一个characters
vector
:
non.food <- c("Cleaning", "Hat", "Shirt", "Bracelet")
I have to eliminate observations that match any of the words from the vector
non.food
.我必须消除与vector
non.food
中的任何单词匹配的观察结果。 For this I use the following code:为此,我使用以下代码:
non.food <- paste(c("Cleaning", "Hat", "Shirt", "Bracelet"), collapse = '|')
mydf <- mydf %>%
filter(!str_detect(Product,non.food))
It works pretty well but I have the impression that I lose more observations than I should.它工作得很好,但我的印象是我失去了更多的观察结果。 For instance, looking at the sample I should lose 8 observations.例如,查看样本我应该失去 8 个观察值。 But in reality I end up losing 10 (I don't show it in the sample since in reality I have 8916 observations, so the sample is just an example of what kind of dataframe I face)但实际上我最终失去了 10(我没有在样本中显示它,因为实际上我有 8916 个观察结果,所以样本只是我面对什么样的 dataframe 的一个例子)
So, I would like to first count the number of observations that match any of the words inside the vector
to be sure that my code
didn't eliminate more observations than it should.因此,我想首先计算与vector
中的任何单词匹配的观察值的数量,以确保我的code
没有消除比它应该消除的更多的观察值。 I cannot use commands as which(mydf$Product == non.food)
or sum(mydf$Product == non.food)
.我不能将命令用作which(mydf$Product == non.food)
或sum(mydf$Product == non.food)
。 I could do the inverse of my code and filter only by observations that match my strings of characters to verify, but it takes more time and creates more data
that I don't want.我可以执行与我的代码相反的操作,仅通过与我的字符串相匹配的观察结果进行过滤以进行验证,但这会花费更多时间并创建更多我不想要的data
。 Does anybody has an idea?有人有想法吗?
Also, if my code
is in fact eliminating more observations than it should, does somebody has a solution?另外,如果我的code
实际上消除了比应有的更多的观察结果,有人有解决方案吗?
Thank you in advance.先感谢您。
You could add a count variable, that counts the number of deleted rows using case_when
, eg您可以添加一个计数变量,使用case_when
计算已删除行的数量,例如
library(tidyverse)
df <- tribble(
~"# of observation", ~Product, ~"Price in $", ~Place,
1, "Pizza", 2, "Supermarket",
2, "Cleaning Product", 3.5, "Supermarket",
3, "Chocolate", 1, "Supermarket",
4, "Fruit", 1, "Little Store",
5, "Red Meat", 2.5, "Supermarket",
6, "Cleaning Product", 3.5, "Supermarket",
7, "Bracelet", 3, "Little Store",
8, "Trucker Hat", 5, "Gas Station",
9, "Shirt", 15, "Supermarket",
10, "Shirt", 20, "Supermarket",
11, "Chicken Breast", 2.5, "Little Store",
12, "Chocolate", 1, "Gas Station",
13, "Cereal", 2, "Gas Station",
14, "Fruit", 1, "Little Store",
15, "Cleaning Product", 3.5, "Supermarket",
16, "Trucker Hat", 4, "Supermarket"
)
non.food <- paste(c("Cleaning", "Hat", "Shirt", "Bracelet"), collapse = "|")
mydf <- df %>%
mutate(count = case_when(
str_detect(Product, non.food) ~ 1,
TRUE ~ 0
)) %>%
mutate(sum_deleted = sum(count)) %>%
filter(!str_detect(Product, non.food))
To count matching or non-matching elements, you can use要计算匹配或不匹配的元素,您可以使用
num_foods <- nrow(mydf[!str_detect(mydf$Product, non.food),])
num_non_foods <- nrow(mydf[str_detect(mydf$Product, non.food),])
You can see, that num_foods == 8
and num_non_foods == 8
, so your code seems to do what it should.你可以看到, num_foods == 8
和num_non_foods == 8
,所以你的代码似乎做了它应该做的。
data数据
mydf <- structure(list(id = 1:16, Product = c("Pizza", "Cleaning Product",
"Chocolate", "Fruit", "Red Meat", "Cleaning Product", "Bracelet",
"Trucker Hat", "Shirt", "Shirt", "Chicken Breast", "Chocolate",
"Cereal", "Fruit", "Cleaning Product", "Trucker Hat"), price = c(2,
3.5, 1, 1, 2.5, 3.5, 3, 5, 15, 20, 2.5, 1, 2, 1, 3.5, 4), place = c("Supermarket",
"Supermarket", "Supermarket", "Little Store", "Supermarket",
"Supermarket", "Little Store", "Gas Station", "Supermarket",
"Supermarket", "Little Store", "Gas Station", "Gas Station",
"Little Store", "Supermarket", "Supermarket")), row.names = c(NA,
-16L), class = "data.frame")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.