如何计算与字符向量值匹配的观察值

Question

I have a dataframe with numerous observations and different type of variables.我有一个dataframe ，其中包含大量观察结果和不同类型的变量。 Here's a sample of my dataframe :这是我的dataframe的示例：

mydf <- structure(list(id = 1:16, Product = c("Pizza", "Cleaning Product", 
"Chocolate", "Fruit", "Red Meat", "Cleaning Product", "Bracelet", 
"Trucker Hat", "Shirt", "Shirt", "Chicken Breast", "Chocolate", 
"Cereal", "Fruit", "Cleaning Product", "Trucker Hat"), price = c(2, 
3.5, 1, 1, 2.5, 3.5, 3, 5, 15, 20, 2.5, 1, 2, 1, 3.5, 4), place = c("Supermarket", 
"Supermarket", "Supermarket", "Little Store", "Supermarket", 
"Supermarket", "Little Store", "Gas Station", "Supermarket", 
"Supermarket", "Little Store", "Gas Station", "Gas Station", 
"Little Store", "Supermarket", "Supermarket")), row.names = c(NA, 
-16L), class = "data.frame")

# of observation # 观察	Product产品	Price in $价格 $	Place地方
1 1个	Pizza比萨	2 2个	Supermarket超级市场
2 2个	Cleaning Product清洁产品	3.5 3.5	Supermarket超级市场
3 3个	Chocolate巧克力	1 1个	Supermarket超级市场
4 4个	Fruit水果	1 1个	Little Store小店
5 5个	Red Meat红肉	2.5 2.5	Supermarket超级市场
6 6个	Cleaning Product清洁产品	3.5 3.5	Supermarket超级市场
7 7	Bracelet手镯	3 3个	Little Store小店
8 8个	Trucker Hat卡车司机帽	5 5个	Gas Station加油站
9 9	Shirt衬衫	15 15	Supermarket超级市场
10 10	Shirt衬衫	20 20	Supermarket超级市场
11 11	Chicken Breast鸡胸肉	2.5 2.5	Little Store小店
12 12	Chocolate巧克力	1 1个	Gas Station加油站
13 13	Cereal谷物	2 2个	Gas Station加油站
14 14	Fruit水果	1 1个	Little Store小店
15 15	Cleaning Product清洁产品	3.5 3.5	Supermarket超级市场
16 16	Trucker Hat卡车司机帽	4 4个	Supermarket超级市场

I also have a vector of characters :我还有一个characters vector ：

non.food <- c("Cleaning", "Hat", "Shirt", "Bracelet")

I have to eliminate observations that match any of the words from the vector non.food .我必须消除与vector non.food中的任何单词匹配的观察结果。 For this I use the following code:为此，我使用以下代码：

non.food <- paste(c("Cleaning", "Hat", "Shirt", "Bracelet"), collapse = '|') 
mydf <- mydf %>% 
filter(!str_detect(Product,non.food))

It works pretty well but I have the impression that I lose more observations than I should.它工作得很好，但我的印象是我失去了更多的观察结果。 For instance, looking at the sample I should lose 8 observations.例如，查看样本我应该失去 8 个观察值。 But in reality I end up losing 10 (I don't show it in the sample since in reality I have 8916 observations, so the sample is just an example of what kind of dataframe I face)但实际上我最终失去了 10（我没有在样本中显示它，因为实际上我有 8916 个观察结果，所以样本只是我面对什么样的 dataframe 的一个例子）

So, I would like to first count the number of observations that match any of the words inside the vector to be sure that my code didn't eliminate more observations than it should.因此，我想首先计算与vector中的任何单词匹配的观察值的数量，以确保我的code没有消除比它应该消除的更多的观察值。 I cannot use commands as which(mydf$Product == non.food) or sum(mydf$Product == non.food) .我不能将命令用作which(mydf$Product == non.food)或sum(mydf$Product == non.food) 。 I could do the inverse of my code and filter only by observations that match my strings of characters to verify, but it takes more time and creates more data that I don't want.我可以执行与我的代码相反的操作，仅通过与我的字符串相匹配的观察结果进行过滤以进行验证，但这会花费更多时间并创建更多我不想要的data 。 Does anybody has an idea?有人有想法吗？

Also, if my code is in fact eliminating more observations than it should, does somebody has a solution?另外，如果我的code实际上消除了比应有的更多的观察结果，有人有解决方案吗？

Thank you in advance.先感谢您。

Answer 1

You could add a count variable, that counts the number of deleted rows using case_when , eg您可以添加一个计数变量，使用case_when计算已删除行的数量，例如

library(tidyverse)
    df <- tribble(
      ~"# of observation", ~Product, ~"Price in $", ~Place,
      1, "Pizza", 2, "Supermarket",
      2, "Cleaning Product", 3.5, "Supermarket",
      3, "Chocolate", 1, "Supermarket",
      4, "Fruit", 1, "Little Store",
      5, "Red Meat", 2.5, "Supermarket",
      6, "Cleaning Product", 3.5, "Supermarket",
      7, "Bracelet", 3, "Little Store",
      8, "Trucker Hat", 5, "Gas Station",
      9, "Shirt", 15, "Supermarket",
      10, "Shirt", 20, "Supermarket",
      11, "Chicken Breast", 2.5, "Little Store",
      12, "Chocolate", 1, "Gas Station",
      13, "Cereal", 2, "Gas Station",
      14, "Fruit", 1, "Little Store",
      15, "Cleaning Product", 3.5, "Supermarket",
      16, "Trucker Hat", 4, "Supermarket"
    )
    
    
    
    non.food <- paste(c("Cleaning", "Hat", "Shirt", "Bracelet"), collapse = "|")
    mydf <- df %>%
      mutate(count = case_when(
        str_detect(Product, non.food) ~ 1,
        TRUE ~ 0
      )) %>%
      mutate(sum_deleted = sum(count)) %>% 
      filter(!str_detect(Product, non.food))

Answer 2

To count matching or non-matching elements, you can use要计算匹配或不匹配的元素，您可以使用

num_foods <- nrow(mydf[!str_detect(mydf$Product, non.food),])
num_non_foods <- nrow(mydf[str_detect(mydf$Product, non.food),])

You can see, that num_foods == 8 and num_non_foods == 8 , so your code seems to do what it should.你可以看到， num_foods == 8和num_non_foods == 8 ，所以你的代码似乎做了它应该做的。

data数据

mydf <- structure(list(id = 1:16, Product = c("Pizza", "Cleaning Product", 
"Chocolate", "Fruit", "Red Meat", "Cleaning Product", "Bracelet", 
"Trucker Hat", "Shirt", "Shirt", "Chicken Breast", "Chocolate", 
"Cereal", "Fruit", "Cleaning Product", "Trucker Hat"), price = c(2, 
3.5, 1, 1, 2.5, 3.5, 3, 5, 15, 20, 2.5, 1, 2, 1, 3.5, 4), place = c("Supermarket", 
"Supermarket", "Supermarket", "Little Store", "Supermarket", 
"Supermarket", "Little Store", "Gas Station", "Supermarket", 
"Supermarket", "Little Store", "Gas Station", "Gas Station", 
"Little Store", "Supermarket", "Supermarket")), row.names = c(NA, 
-16L), class = "data.frame")

如何计算与字符向量值匹配的观察值

问题描述

2 个解决方案

解决方案1
1 2022-05-03 08:31:40

解决方案2
1 已采纳 2022-05-03 08:35:31

如何计算与字符向量值匹配的观察值

问题描述

2 个解决方案

解决方案1 1 2022-05-03 08:31:40

解决方案2 1 已采纳 2022-05-03 08:35:31

解决方案1
1 2022-05-03 08:31:40

解决方案2
1 已采纳 2022-05-03 08:35:31