简体   繁体   English

数据转换:从R中的二元数据到观测数据

[英]Data transformation: from dyadic to observational data in R

I have a (directed) dyadic dataset that looks something like this (see below). 我有一个(定向的)二元数据集,看起来像这样(见下文)。 What I want to do now is to keep just one observation per year. 我现在要做的是每年仅保留一次观察。 So in this case only one observation for 1992 (AFG 1992) and one in 1993 (AFG 1993), while deleting other observations. 因此,在这种情况下,仅对1992年的一个观测值(AFG 1992)和1993年的一个观测值(AFG 1993)删除了其他观测值。 It doesn't matter which observation from the same year I keep in the data (not interested in country2). 我保留同一年的观察数据并不重要(对country2不感兴趣)。

 country1   country2    year    X   X1
Afghanistan Colombia    1992    1   0.44
Afghanistan Venezuela   1992    1   0.45
Afghanistan Peru        1992    1   0.46
Afghanistan Brazil      1992    1   0.47
Afghanistan Bolivia     1992    1   0.48
Afghanistan Chile       1992    1   0.49
Afghanistan Argentina   1992    1   0.50
Afghanistan Uruguay     1993    0   0.51
Afghanistan USA         1993    0   0.52
Afghanistan Canada      1993    0   0.53
Afghanistan UK          1993    0   0.54
Afghanistan Netherlands 1993    0   0.55
Afghanistan Belgium     1993    0   0.56
Afghanistan Luxembourg  1993    0   0.57
Afghanistan France      1993    0   0.58

My try: 我的尝试:

newdata<- data %>% 
  group_by(country1,year) %>%
  summarise() %>%
  select(unique.x=country1, unique.y=year)

This works BUT how do I keep all other variables from "data" in the "newdata"? 这有效但是如何在“ newdata”中将所有其他变量保留在“ data”中? I can't think of any way of doing this (which I find more practical). 我想不出任何办法 (我觉得更实用)。 Any help? 有什么帮助吗?

Desired outcome 期望的结果

    country1     year   X
    Afghanistan 1991   1
    Afghanistan 1992   0

dput(data) structure(list(country1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Afghanistan", class = "factor"), country2 = structure(c(8L, 33L, 24L, 5L, 4L, 7L, 1L, 32L, 31L, 6L, 30L, 21L, 3L, 19L, 14L, 29L, 27L, 26L, 15L, 25L, 2L, 17L, 10L, 18L, 13L, 28L, 23L, 11L, 9L, 16L, 12L, 20L, 22L), .Label = c("Argentina", "Austria", "Belgium", "Bolivia, Plurinational State of", "Brazil", "Canada", "Chile", "Colombia", "Cuba", "Czech Republic", "Denmark", "Dominican Republic", "Finland", "France", "Germany", "Guinea-Bissau", "Hungary", "Italy", "Luxembourg", "Mauritania", "Netherlands", "Niger", "Norway", "Peru", "Poland", "Portugal", "Spain", "Sweden", "Switzerland", "United Kingdom", "United States", "Uruguay", "Venezuela, Bolivarian Republic of"), class = "factor"), year = c(1992L, 1992L, 1992L, 1992L, 1992L, 1992L, 1992L, 1993L, 1993L, 1993L, 1993L, 1993L, 1993L, 1993L, 1993L, 19 dput(data)structure(list(country1 = structure(c(1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L, 1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L)、. Label =“阿富汗”,类别=“ factor”),country2 =结构(c (8L,33L,24L,5L,4L,7L,1L,32L,31L,6L,30L,21L,3L,19L,14L,29L,27L,26L,15L,25L,2L,17L,10L,18L,13L ,28L,23L,11L,9L,16L,12L,20L,22L)、. Label = c(“阿根廷”,“奥地利”,“比利时”,“玻利维亚,多民族国”,“巴西”,“加拿大” ,“智利”,“哥伦比亚”,“古巴”,“捷克共和国”,“丹麦”,“多米尼加共和国”,“芬兰”,“法国”,“德国”,“几内亚比绍”,“匈牙利”,“意大利”,“卢森堡”,“毛里塔尼亚”,“荷兰”,“尼日尔”,“挪威”,“秘鲁”,“波兰”,“葡萄牙”,“西班牙”,“瑞典”,“瑞士”,“英国” “,”美国“,”乌拉圭“,”委内瑞拉玻利瓦尔共和国“),类别=“因子”),年份= c(1992L,1992L,1992L,1992L,1992L,1992L,1992L,1992L,1993L,1993L,1993L ,1993L,1993L,1993L,1993L,1993L,19 94L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1995L, 1995L, 1995L, 1995L, 1995L, 1995L, 1995L, 1995L, 1995L, 1995L), X = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), X1 = c(0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5, 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6, 0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7, 0.71, 0.72, 0.73, 0.74, 0.75, 0.76)), .Names = c("country1", "country2", "year", "X", "X1"), class = "data.frame", row.names = c(NA, -33L)) 94L,1994L,1994L,1994L,1994L,1994L,1994L,1994L,1995L,1995L,1995L,1995L,1995L,1995L,1995L,1995L,1995L,1995L),X = c(1L,1L,1L,1L,1L ,1L,1L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,1L,1L,1L,1L,1L,1L,1L ,1L,1L,1L),X1 = c(0.44,0.45,0.46,0.47,0.48,0.49,0.5,0.51,0.52,0.53,0.54,0.55,0.56,0.57,0.58,0.59,0.6,0.61,0.62, 0.63、0.64、0.65、0.66、0.67、0.68、0.69、0.7、0.71、0.72、0.73、0.74、0.75、0.76))。名称= c(“ country1”,“ country2”,“年”,“ X” ,“ X1”),class =“ data.frame”,row.names = c(NA,-33L))

newdata <- olddata[!duplicated(olddata$year),]

Answers the question 回答问题

newdata <- olddata[!duplicated(paste(olddata$country1, olddata$year)),]

Gives you what you want 给你你想要的

I don't truly understand your question, but to get your desired output you can use: 我不太了解您的问题,但是要获得所需的输出 ,可以使用:

data %>% 
  group_by(country1, year) %>%
  summarise(X = mean(X))

When you apply this to your entire data.frame, bear in mind this code will return the mean of all values in X for unique combinations of country1 and year . 当您将其应用于整个data.frame时,请记住,此代码将针对country1year唯一组合返回X中所有值的平均值。

you can try: 你可以试试:

data %>%
    group_by(year) %>%
    top_n(1) %>%
    select(country1, X)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM