[英]How can we detect observations having different outcomes in R?
I have a huge data in this form and with more other columns.我有这种形式的大量数据以及更多其他列。 So I have a list of people working in a country in 2011 and moved to another one in 2012.
所以我有一份 2011 年在一个国家工作并在 2012 年搬到另一个国家的人的名单。
Name Work_{2011} Work_{2012} Wage_{2011} Wage_{2012}
Jack US UK 5387 35353
Bill US UK 43534 5343
Emma US FRANCE 34534 53455
Brand US FRANCE 64545 1343
Luigui US FRANCE 15343 3144
Ella US FRANCE 64545 1343
Lucie France SPAIN 84545 1343
Maria France SPAIN 984545 1343
Grec Italy US 4545 1343
I want to keep the observations having the biggest share of destination for each departure.我想保持观察在每次出发的目的地中所占的份额最大。 I want:
我想:
Name Work_{2011} Work_{2012} Wage_{2011} Wage_{2012}
Emma US FRANCE 34534 53455
Brand US FRANCE 64545 1343
Luigui US FRANCE 15343 3144
Ella US FRANCE 64545 1343
Lucie France SPAIN 84545 1343
Maria France SPAIN 984545 1343
Grec Italy US 4545 1343
I'm not 100% sure this will meet your needs, but perhaps it will be helpful for you.我不是 100% 确定这会满足您的需求,但也许它会对您有所帮助。 It might help to know more details about your data, including how large your dataset is, how your columns are organized by year, etc.
了解有关您的数据的更多详细信息可能会有所帮助,包括您的数据集有多大,您的列如何按年份组织等。
In this example, you can use dplyr
from tidyverse
.在此示例中,您可以使用
dplyr
中的tidyverse
。 First, you can group_by
Work_2011
(I removed the braces from column names), and filter
where the number of distinct values for Work_2012
is greater than 1. This would imply multiple destinations.首先,您可以
group_by
Work_2011
(我从列名中删除了大括号),并filter
Work_2012
的不同值的数量大于 1 的位置。这意味着多个目的地。
Second, you can group_by
both Work_2011
and Work_2012
to determine the number of countries for each destination.其次,您可以
group_by
Work_2011
和Work_2012
来确定每个目的地的国家/地区数量。 This will be helpful in a second filter
.这将有助于第二个
filter
。
Again, please let me know if this is the direction you were interested in.再次,请让我知道这是否是您感兴趣的方向。
library(dplyr)
df %>%
group_by(Work_2011) %>%
filter(n_distinct(Work_2012) > 1) %>%
group_by(Work_2011, Work_2012) %>%
mutate(numctry = n()) %>%
group_by(Work_2011) %>%
filter(numctry == max(numctry))
Output Output
Name Work_2011 Work_2012 numctry
<chr> <chr> <chr> <int>
1 Emma US FRANCE 4
2 Brand US FRANCE 4
3 Luigui US FRANCE 4
4 Ella US FRANCE 4
Edit (1/13/21) : Based on edited question, we can simplify the code further.编辑(21 年 1 月 13 日) :根据编辑的问题,我们可以进一步简化代码。
Start by calculating the number of destinations per country, we'll call this dest_per_cntry
.首先计算每个国家的目的地数量,我们称之为
dest_per_cntry
。 This will be a new column.这将是一个新列。 For Jack and Bill, it will be 2. For Emma, Brand, Luigui, and Ella, it will be 4.
对于 Jack 和 Bill,它将是 2。对于 Emma、Brand、Luigui 和 Ella,它将是 4。
Then, you can group_by
to consider the 2011 country only.然后,您可以
group_by
仅考虑 2011 年的国家/地区。 For each country in the Work_2011
column, keep (or filter
) only those where the dest_per_cntry
is the same as the maximum number of destinations for that country.对于
Work_2011
列中的每个国家/地区,仅保留(或filter
) dest_per_cntry
与该国家/地区的最大目的地数量相同的国家/地区。 Note that if there are "ties", all rows with the maximum destinations will still be kept.请注意,如果存在“关系”,则仍将保留具有最大目的地的所有行。
library(tidyverse)
df %>%
group_by(Work_2011, Work_2012) %>%
mutate(dest_per_cntry = n()) %>%
group_by(Work_2011) %>%
filter(dest_per_cntry == max(dest_per_cntry))
Output Output
Name Work_2011 Work_2012 Wage_2011 Wage_2012 dest_per_cntry
<chr> <chr> <chr> <int> <int> <int>
1 Emma US FRANCE 34534 53455 4
2 Brand US FRANCE 64545 1343 4
3 Luigui US FRANCE 15343 3144 4
4 Ella US FRANCE 64545 1343 4
5 Lucie France SPAIN 84545 1343 2
6 Maria France SPAIN 984545 1343 2
7 Grec Italy US 4545 1343 1
Data数据
df <- structure(list(Name = c("Jack", "Bill", "Emma", "Brand", "Luigui",
"Ella", "Lucie", "Maria", "Grec"), Work_2011 = c("US", "US",
"US", "US", "US", "US", "France", "France", "Italy"), Work_2012 = c("UK",
"UK", "FRANCE", "FRANCE", "FRANCE", "FRANCE", "SPAIN", "SPAIN",
"US"), Wage_2011 = c(5387L, 43534L, 34534L, 64545L, 15343L, 64545L,
84545L, 984545L, 4545L), Wage_2012 = c(35353L, 5343L, 53455L,
1343L, 3144L, 1343L, 1343L, 1343L, 1343L)), class = "data.frame", row.names = c(NA,
-9L))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.