我们如何检测 R 中具有不同结果的观察结果？

Question

I have a huge data in this form and with more other columns.我有这种形式的大量数据以及更多其他列。 So I have a list of people working in a country in 2011 and moved to another one in 2012.所以我有一份 2011 年在一个国家工作并在 2012 年搬到另一个国家的人的名单。

Name  Work_{2011}     Work_{2012}     Wage_{2011}    Wage_{2012} 
  
Jack     US              UK            5387           35353
Bill     US              UK            43534          5343
Emma     US              FRANCE        34534          53455
Brand    US              FRANCE        64545          1343
Luigui   US              FRANCE        15343          3144
Ella     US              FRANCE        64545          1343       
Lucie    France          SPAIN         84545          1343
Maria    France          SPAIN         984545         1343
Grec     Italy           US            4545           1343

I want to keep the observations having the biggest share of destination for each departure.我想保持观察在每次出发的目的地中所占的份额最大。 I want:我想：

Name  Work_{2011}     Work_{2012}     Wage_{2011}    Wage_{2012} 
  
Emma     US              FRANCE        34534          53455
Brand    US              FRANCE        64545          1343
Luigui   US              FRANCE        15343          3144
Ella     US              FRANCE        64545          1343       
Lucie    France          SPAIN         84545          1343
Maria    France          SPAIN         984545         1343
Grec     Italy           US            4545           1343

Answer 1

I'm not 100% sure this will meet your needs, but perhaps it will be helpful for you.我不是 100% 确定这会满足您的需求，但也许它会对您有所帮助。 It might help to know more details about your data, including how large your dataset is, how your columns are organized by year, etc.了解有关您的数据的更多详细信息可能会有所帮助，包括您的数据集有多大，您的列如何按年份组织等。

In this example, you can use dplyr from tidyverse .在此示例中，您可以使用dplyr中的tidyverse 。 First, you can group_by Work_2011 (I removed the braces from column names), and filter where the number of distinct values for Work_2012 is greater than 1. This would imply multiple destinations.首先，您可以group_by Work_2011 （我从列名中删除了大括号），并filter Work_2012的不同值的数量大于 1 的位置。这意味着多个目的地。

Second, you can group_by both Work_2011 and Work_2012 to determine the number of countries for each destination.其次，您可以group_by Work_2011和Work_2012来确定每个目的地的国家/地区数量。 This will be helpful in a second filter .这将有助于第二个filter 。

Again, please let me know if this is the direction you were interested in.再次，请让我知道这是否是您感兴趣的方向。

library(dplyr)

df %>%
  group_by(Work_2011) %>%
  filter(n_distinct(Work_2012) > 1) %>%
  group_by(Work_2011, Work_2012) %>%
  mutate(numctry = n()) %>%
  group_by(Work_2011) %>%
  filter(numctry == max(numctry))

Output Output

  Name   Work_2011 Work_2012 numctry
  <chr>  <chr>     <chr>       <int>
1 Emma   US        FRANCE          4
2 Brand  US        FRANCE          4
3 Luigui US        FRANCE          4
4 Ella   US        FRANCE          4

Edit (1/13/21) : Based on edited question, we can simplify the code further.编辑（21 年 1 月 13 日）：根据编辑的问题，我们可以进一步简化代码。

Start by calculating the number of destinations per country, we'll call this dest_per_cntry .首先计算每个国家的目的地数量，我们称之为dest_per_cntry 。 This will be a new column.这将是一个新列。 For Jack and Bill, it will be 2. For Emma, Brand, Luigui, and Ella, it will be 4.对于 Jack 和 Bill，它将是 2。对于 Emma、Brand、Luigui 和 Ella，它将是 4。

Then, you can group_by to consider the 2011 country only.然后，您可以group_by仅考虑 2011 年的国家/地区。 For each country in the Work_2011 column, keep (or filter ) only those where the dest_per_cntry is the same as the maximum number of destinations for that country.对于Work_2011列中的每个国家/地区，仅保留（或filter ） dest_per_cntry与该国家/地区的最大目的地数量相同的国家/地区。 Note that if there are "ties", all rows with the maximum destinations will still be kept.请注意，如果存在“关系”，则仍将保留具有最大目的地的所有行。

library(tidyverse)

df %>%
  group_by(Work_2011, Work_2012) %>%
  mutate(dest_per_cntry = n()) %>%
  group_by(Work_2011) %>%
  filter(dest_per_cntry == max(dest_per_cntry))

Output Output

  Name   Work_2011 Work_2012 Wage_2011 Wage_2012 dest_per_cntry
  <chr>  <chr>     <chr>         <int>     <int>          <int>
1 Emma   US        FRANCE        34534     53455              4
2 Brand  US        FRANCE        64545      1343              4
3 Luigui US        FRANCE        15343      3144              4
4 Ella   US        FRANCE        64545      1343              4
5 Lucie  France    SPAIN         84545      1343              2
6 Maria  France    SPAIN        984545      1343              2
7 Grec   Italy     US             4545      1343              1

Data数据

df <- structure(list(Name = c("Jack", "Bill", "Emma", "Brand", "Luigui", 
"Ella", "Lucie", "Maria", "Grec"), Work_2011 = c("US", "US", 
"US", "US", "US", "US", "France", "France", "Italy"), Work_2012 = c("UK", 
"UK", "FRANCE", "FRANCE", "FRANCE", "FRANCE", "SPAIN", "SPAIN", 
"US"), Wage_2011 = c(5387L, 43534L, 34534L, 64545L, 15343L, 64545L, 
84545L, 984545L, 4545L), Wage_2012 = c(35353L, 5343L, 53455L, 
1343L, 3144L, 1343L, 1343L, 1343L, 1343L)), class = "data.frame", row.names = c(NA, 
-9L))

我们如何检测 R 中具有不同结果的观察结果？

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-12-09 18:08:31

我们如何检测 R 中具有不同结果的观察结果？

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-12-09 18:08:31

解决方案1
0 已采纳 2020-12-09 18:08:31