简体   繁体   English

R - 使用grepl和相同的多个条件嵌套for循环

[英]R - Nested for loops with multiple conditions using grepl and identical

I am trying to mark all fruit with a "1" if it is only supplied by one country or a "0" otherwise. 我试图用“1”标记所有水果,如果它仅由一个国家提供或否则为“0”。

I have two tables of data: 我有两个数据表:

Table 1: 表格1:

Fruit - Each row has a different fruit in it eg Apple, Banana, Peach,etc... 水果 - 每排都有不同的水果,如苹果,香蕉,桃子等......

Country - Each row has the fruits main country of supply in 2-digit iso format eg US, UK, NO, etc... 国家 - 每行有2位数iso格式的水果主要供应国,例如美国,英国,NO等...

SourceUnique - This is the column I want to fill with "1" in rows with fruit that are only supplied by one country and "0" otherwise. SourceUnique - 这是我想在行中填充“1”的列,其中水果只由一个国家提供,否则为“0”。

Table 2: 表2:

Country - Each row has the suppliers country in 2-digit iso format like the last table. 国家/地区 - 每行都有2位iso格式的供应商国家/地区,如上一个表格。

Supplies - Each row has a list of fruits that the supplier delivers eg row 1 is "Apple, Banana", row 2 is "Pineapple, Peach, Pear, Apple", etc... 供应 - 每行都有供应商提供的水果清单,例如第1行是“Apple,Banana”,第2行是“Pineapple,Peach,Pear,Apple”等......

Both tables are imported from CSV files then my code is as follows: 两个表都是从CSV文件导入的,然后我的代码如下:

Table1$SourceUnique=rep(1,length(Table1$Country))

for(i in 1:length(Table1$Country)){
  for(k in 1:length(Table2$Country)){
    if(grepl(Table1$Fruit[i], Table2$Supplies[k])==TRUE && identical(Table1$Country[i], Table2$Country[k])==FALSE){
      Table1$SourceUnique[i]=0
    }
  }
}

I get no errors but the SourceUnique column does not fill correctly. 我没有错误,但SourceUnique列没有正确填充。 I get 1's and 0's with some correct and others not. 我得到1和0的一些正确而其他没有。 After lots of searching and messing around I have accepted that I have no idea and need help, so any advice or solutions would be fantastic. 经过大量的搜索和讨论,我已经接受了我不知道并需要帮助,所以任何建议或解决方案都会很棒。

Thanks. 谢谢。

Edit for more info: 编辑以获取更多信息:

Some fruits have many suppliers from the same country and Table2$Supplies is messy with other words in it annoyingly. 有些水果有来自同一个国家的许多供应商,Table2 $供应中的其他词语令人讨厌。

Example data: 示例数据:

Table1$Country <- c("UK","US","NO")
Table1$Fruit <- c("Apple","Banana","Pear")

Table2$Country <- c("UK","US","UK")
Table2$Supplies <- c("Apple,Pear","Banana,Pear","Banana and Apple")

Edit Again: 再次编辑:

grepl and identical work in my code when I run them separately with numbers. 当我用数字单独运行时,我的代码中的grepl和相同的工作。 I can't understand why they do not work in my loops... In theory my code loops through "Supplies", searches the two criteria and returns a 0 when both criteria are satisfied. 我无法理解为什么它们不能在我的循环中工作......理论上我的代码循环通过“Supplies”,搜索两个条件并在满足两个条件时返回0。 It then moves on to the next i ("fruit") and repeats. 然后它继续前进到下一个i(“果实”)并重复。 Maybe the && is my problem? 也许&&是我的问题? it seems correct from my knowledge. 从我的知识来看似乎是正确的。

An Excel solution would also work for my purpose but I am not experienced enough with Excel to know where to start with that. Excel解决方案也可以用于我的目的,但我没有足够的经验与Excel知道从哪里开始。

Perhaps you can simplify the problem by counting the occurrences of each fruit in table 2 for each country: 也许您可以通过计算每个国家/地区表2中每种水果的出现次数来简化问题:

for (i in Table1$Fruit){
  as.integer(rowSums(table(grepl(i,Table2$Supplies),Table2$Country))[2]==1)
}

This gives you 1 for those fruits that only occur once per country in table 2 and 0 otherwise. 对于那些仅在表2中每个国家出现一次的水果,这给你1 ,否则为0

Assuming it's possible to construct a regular expression to extract the values of "fruit" from the Supplies column in your real data, here's a data manipulation approach to the problem. 假设可以构造一个正则表达式来从实际数据中的Supplies列中提取“fruit”的值,这里是一个解决问题的数据操作方法。

# prepare your sample data
fruit <- suppliers <- list()

fruit$Fruit <- c("Apple","Banana","Pear")
fruit$Country <- c("UK","US","NO")
fruit <- data.frame(fruit)

suppliers$Country <- c("UK","US","UK")
suppliers$Supplies <- c("Apple,Pear","Banana,Pear","Banana and Apple")
suppliers <- data.frame(suppliers)

library(dplyr)
library(tidyr)  # version 0.5.0 or later

# data manipulation for the desired result
suppliers %>%
    # split values of Supplies into a new row at each occurance of sep
    separate_rows(Supplies, sep = "\\s*(and|,)\\s*") %>%
    group_by(Supplies) %>%
    # summarize which fruit are supplied from only one country
    summarize(SourceUnique = as.numeric(n_distinct(Country) == 1)) %>%
    left_join(fruit, ., by = c("Fruit" = "Supplies"))
#        Fruit Country SourceUnique
#     1  Apple      UK            1
#     2 Banana      US            0
#     3   Pear      NO            0

If speed is desired, the same could likely be formulated using data.table s which provide excellent performance for working with large data. 如果需要速度,可以使用data.table来制定相同的速度,这可以提供处理大数据的出色性能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM