简体   繁体   English

我们如何检测 R 中具有不同结果的观察结果?

[英]How can we detect observations having different outcomes in R?

I have a huge data in this form and with more other columns.我有这种形式的大量数据以及更多其他列。 So I have a list of people working in a country in 2011 and moved to another one in 2012.所以我有一份 2011 年在一个国家工作并在 2012 年搬到另一个国家的人的名单。

Name  Work_{2011}     Work_{2012}     Wage_{2011}    Wage_{2012} 
  
Jack     US              UK            5387           35353
Bill     US              UK            43534          5343
Emma     US              FRANCE        34534          53455
Brand    US              FRANCE        64545          1343
Luigui   US              FRANCE        15343          3144
Ella     US              FRANCE        64545          1343       
Lucie    France          SPAIN         84545          1343
Maria    France          SPAIN         984545         1343
Grec     Italy           US            4545           1343

I want to keep the observations having the biggest share of destination for each departure.我想保持观察在每次出发的目的地中所占的份额最大。 I want:我想:

Name  Work_{2011}     Work_{2012}     Wage_{2011}    Wage_{2012} 
  
Emma     US              FRANCE        34534          53455
Brand    US              FRANCE        64545          1343
Luigui   US              FRANCE        15343          3144
Ella     US              FRANCE        64545          1343       
Lucie    France          SPAIN         84545          1343
Maria    France          SPAIN         984545         1343
Grec     Italy           US            4545           1343

I'm not 100% sure this will meet your needs, but perhaps it will be helpful for you.我不是 100% 确定这会满足您的需求,但也许它会对您有所帮助。 It might help to know more details about your data, including how large your dataset is, how your columns are organized by year, etc.了解有关您的数据的更多详细信息可能会有所帮助,包括您的数据集有多大,您的列如何按年份组织等。

In this example, you can use dplyr from tidyverse .在此示例中,您可以使用dplyr中的tidyverse First, you can group_by Work_2011 (I removed the braces from column names), and filter where the number of distinct values for Work_2012 is greater than 1. This would imply multiple destinations.首先,您可以group_by Work_2011 (我从列名中删除了大括号),并filter Work_2012的不同值的数量大于 1 的位置。这意味着多个目的地。

Second, you can group_by both Work_2011 and Work_2012 to determine the number of countries for each destination.其次,您可以group_by Work_2011Work_2012来确定每个目的地的国家/地区数量。 This will be helpful in a second filter .这将有助于第二个filter

Again, please let me know if this is the direction you were interested in.再次,请让我知道这是否是您感兴趣的方向。

library(dplyr)

df %>%
  group_by(Work_2011) %>%
  filter(n_distinct(Work_2012) > 1) %>%
  group_by(Work_2011, Work_2012) %>%
  mutate(numctry = n()) %>%
  group_by(Work_2011) %>%
  filter(numctry == max(numctry))

Output Output

  Name   Work_2011 Work_2012 numctry
  <chr>  <chr>     <chr>       <int>
1 Emma   US        FRANCE          4
2 Brand  US        FRANCE          4
3 Luigui US        FRANCE          4
4 Ella   US        FRANCE          4

Edit (1/13/21) : Based on edited question, we can simplify the code further.编辑(21 年 1 月 13 日) :根据编辑的问题,我们可以进一步简化代码。

Start by calculating the number of destinations per country, we'll call this dest_per_cntry .首先计算每个国家的目的地数量,我们称之为dest_per_cntry This will be a new column.这将是一个新列。 For Jack and Bill, it will be 2. For Emma, Brand, Luigui, and Ella, it will be 4.对于 Jack 和 Bill,它将是 2。对于 Emma、Brand、Luigui 和 Ella,它将是 4。

Then, you can group_by to consider the 2011 country only.然后,您可以group_by仅考虑 2011 年的国家/地区。 For each country in the Work_2011 column, keep (or filter ) only those where the dest_per_cntry is the same as the maximum number of destinations for that country.对于Work_2011列中的每个国家/地区,仅保留(或filterdest_per_cntry与该国家/地区的最大目的地数量相同的国家/地区。 Note that if there are "ties", all rows with the maximum destinations will still be kept.请注意,如果存在“关系”,则仍将保留具有最大目的地的所有行。

library(tidyverse)

df %>%
  group_by(Work_2011, Work_2012) %>%
  mutate(dest_per_cntry = n()) %>%
  group_by(Work_2011) %>%
  filter(dest_per_cntry == max(dest_per_cntry))

Output Output

  Name   Work_2011 Work_2012 Wage_2011 Wage_2012 dest_per_cntry
  <chr>  <chr>     <chr>         <int>     <int>          <int>
1 Emma   US        FRANCE        34534     53455              4
2 Brand  US        FRANCE        64545      1343              4
3 Luigui US        FRANCE        15343      3144              4
4 Ella   US        FRANCE        64545      1343              4
5 Lucie  France    SPAIN         84545      1343              2
6 Maria  France    SPAIN        984545      1343              2
7 Grec   Italy     US             4545      1343              1

Data数据

df <- structure(list(Name = c("Jack", "Bill", "Emma", "Brand", "Luigui", 
"Ella", "Lucie", "Maria", "Grec"), Work_2011 = c("US", "US", 
"US", "US", "US", "US", "France", "France", "Italy"), Work_2012 = c("UK", 
"UK", "FRANCE", "FRANCE", "FRANCE", "FRANCE", "SPAIN", "SPAIN", 
"US"), Wage_2011 = c(5387L, 43534L, 34534L, 64545L, 15343L, 64545L, 
84545L, 984545L, 4545L), Wage_2012 = c(35353L, 5343L, 53455L, 
1343L, 3144L, 1343L, 1343L, 1343L, 1343L)), class = "data.frame", row.names = c(NA, 
-9L))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在观察上运行 function 获得多个结果? - how to get multiple outcomes for running a function on observations? 如何在 R 中使用 glm 循环多次曝光和结果以及不同的模型? - How to loop multiple exposures and outcomes as well as different models with glm in R? 我们如何在R中的箱线图中检测上下极限? - How can we detect the lower and upper extreme in boxplot in R? 拥有数据帧列表时如何有条件地更改R中的观测值 - How to conditionally change the value of observations in R when having a list of dataframes 如何重命名 R 中不同数据框中不同列中的观察值? - How to rename observations in different columns in different dataframes in R? 在 r 中翻译 stata 代码,但结果不同 - translate stata codes in r, but the outcomes are different 如何识别 R 中两个不同列中观察结果相同的行? - How to identify rows where observations are identical in two different columns in R? 在R中,当每个观察值在不同的变量中时,如何计算观察值 - In R, how to count observations when each observation is in a different variable 我们如何使用 R 中的闭包为 5 个不同的玩家构建计时器 - How can we build a timer for 5 different player using closure in R 您如何使用从R中的其他数据集创建的模型来预测新数据集的结果? - How do you predict outcomes from a new dataset using a model created from a different dataset in R?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM