[英]R - Nested for loops with multiple conditions using grepl and identical
I am trying to mark all fruit with a "1" if it is only supplied by one country or a "0" otherwise. 我试图用“1”标记所有水果,如果它仅由一个国家提供或否则为“0”。
I have two tables of data: 我有两个数据表:
Table 1: 表格1:
Fruit - Each row has a different fruit in it eg Apple, Banana, Peach,etc... 水果 - 每排都有不同的水果,如苹果,香蕉,桃子等......
Country - Each row has the fruits main country of supply in 2-digit iso format eg US, UK, NO, etc... 国家 - 每行有2位数iso格式的水果主要供应国,例如美国,英国,NO等...
SourceUnique - This is the column I want to fill with "1" in rows with fruit that are only supplied by one country and "0" otherwise. SourceUnique - 这是我想在行中填充“1”的列,其中水果只由一个国家提供,否则为“0”。
Table 2: 表2:
Country - Each row has the suppliers country in 2-digit iso format like the last table. 国家/地区 - 每行都有2位iso格式的供应商国家/地区,如上一个表格。
Supplies - Each row has a list of fruits that the supplier delivers eg row 1 is "Apple, Banana", row 2 is "Pineapple, Peach, Pear, Apple", etc... 供应 - 每行都有供应商提供的水果清单,例如第1行是“Apple,Banana”,第2行是“Pineapple,Peach,Pear,Apple”等......
Both tables are imported from CSV files then my code is as follows: 两个表都是从CSV文件导入的,然后我的代码如下:
Table1$SourceUnique=rep(1,length(Table1$Country))
for(i in 1:length(Table1$Country)){
for(k in 1:length(Table2$Country)){
if(grepl(Table1$Fruit[i], Table2$Supplies[k])==TRUE && identical(Table1$Country[i], Table2$Country[k])==FALSE){
Table1$SourceUnique[i]=0
}
}
}
I get no errors but the SourceUnique column does not fill correctly. 我没有错误,但SourceUnique列没有正确填充。 I get 1's and 0's with some correct and others not. 我得到1和0的一些正确而其他没有。 After lots of searching and messing around I have accepted that I have no idea and need help, so any advice or solutions would be fantastic. 经过大量的搜索和讨论,我已经接受了我不知道并需要帮助,所以任何建议或解决方案都会很棒。
Thanks. 谢谢。
Edit for more info: 编辑以获取更多信息:
Some fruits have many suppliers from the same country and Table2$Supplies is messy with other words in it annoyingly. 有些水果有来自同一个国家的许多供应商,Table2 $供应中的其他词语令人讨厌。
Example data: 示例数据:
Table1$Country <- c("UK","US","NO")
Table1$Fruit <- c("Apple","Banana","Pear")
Table2$Country <- c("UK","US","UK")
Table2$Supplies <- c("Apple,Pear","Banana,Pear","Banana and Apple")
Edit Again: 再次编辑:
grepl and identical work in my code when I run them separately with numbers. 当我用数字单独运行时,我的代码中的grepl和相同的工作。 I can't understand why they do not work in my loops... In theory my code loops through "Supplies", searches the two criteria and returns a 0 when both criteria are satisfied. 我无法理解为什么它们不能在我的循环中工作......理论上我的代码循环通过“Supplies”,搜索两个条件并在满足两个条件时返回0。 It then moves on to the next i ("fruit") and repeats. 然后它继续前进到下一个i(“果实”)并重复。 Maybe the && is my problem? 也许&&是我的问题? it seems correct from my knowledge. 从我的知识来看似乎是正确的。
An Excel solution would also work for my purpose but I am not experienced enough with Excel to know where to start with that. Excel解决方案也可以用于我的目的,但我没有足够的经验与Excel知道从哪里开始。
Perhaps you can simplify the problem by counting the occurrences of each fruit in table 2 for each country: 也许您可以通过计算每个国家/地区表2中每种水果的出现次数来简化问题:
for (i in Table1$Fruit){
as.integer(rowSums(table(grepl(i,Table2$Supplies),Table2$Country))[2]==1)
}
This gives you 1
for those fruits that only occur once per country in table 2 and 0
otherwise. 对于那些仅在表2中每个国家出现一次的水果,这给你1
,否则为0
。
Assuming it's possible to construct a regular expression to extract the values of "fruit" from the Supplies
column in your real data, here's a data manipulation approach to the problem. 假设可以构造一个正则表达式来从实际数据中的Supplies
列中提取“fruit”的值,这里是一个解决问题的数据操作方法。
# prepare your sample data
fruit <- suppliers <- list()
fruit$Fruit <- c("Apple","Banana","Pear")
fruit$Country <- c("UK","US","NO")
fruit <- data.frame(fruit)
suppliers$Country <- c("UK","US","UK")
suppliers$Supplies <- c("Apple,Pear","Banana,Pear","Banana and Apple")
suppliers <- data.frame(suppliers)
library(dplyr)
library(tidyr) # version 0.5.0 or later
# data manipulation for the desired result
suppliers %>%
# split values of Supplies into a new row at each occurance of sep
separate_rows(Supplies, sep = "\\s*(and|,)\\s*") %>%
group_by(Supplies) %>%
# summarize which fruit are supplied from only one country
summarize(SourceUnique = as.numeric(n_distinct(Country) == 1)) %>%
left_join(fruit, ., by = c("Fruit" = "Supplies"))
# Fruit Country SourceUnique
# 1 Apple UK 1
# 2 Banana US 0
# 3 Pear NO 0
If speed is desired, the same could likely be formulated using data.table
s which provide excellent performance for working with large data. 如果需要速度,可以使用data.table
来制定相同的速度,这可以提供处理大数据的出色性能。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.