简体   繁体   English

R-比较具有不同值的数据框的不同列

[英]R- compare different columns of a data frame with different values

I am currently working on microdata, using a survey called SHARE.我目前正在使用一项名为 SHARE 的调查研究微数据。 I want to use a variable for education but the way it was coded makes it kind of hard.我想使用一个变量进行教育,但它的编码方式有点困难。

In the survey, households are asked what degree they have.在调查中,家庭被问及他们拥有什么学位。 There is one column for each degree and it takes the value 0 or 1 if the interviewed has the degree or not.每个学位有一个列,如果被采访者有学位或没有学位,则取值为 0 或 1。 The issue is that I have two countries with different degrees, but they are using the same column, so I have to go to the user manual to find for each country to which degree corresponds each 0 or 1. I was able to do so and then translate it to an international way of measuring education.问题是我有两个不同程度的国家,但他们使用的是同一列,所以我必须去用户手册找到每个国家的度数分别对应于 0 或 1。我能够这样做并且然后将其转化为衡量教育的国际方式。

My idea was to sum each column and then having only one column for each household.我的想法是对每一列求和,然后每个家庭只有一列。 However, I wasn't able to proceed because some people have many degrees.但是,我无法继续,因为有些人有很多学位。 I would like to get the highest degree of each household.我想得到每个家庭的最高学位。 I would like to have your help on this issue.我想在这个问题上得到你的帮助。

Here are tables of what I have and what I would like:以下是我拥有的和想要的表格:

Let imagine in Germany the first diplome is equivalent to the first diplome in international standards, the second and thee third in Germany are the same as the second diplom in international standards and the last diplom in Germany is the same as the third internationally.让我们想象一下,德国的第一个文凭相当于国际标准的第一个文凭,德国的第二个和第三个相当于国际标准的第二个文凭,德国的最后一个文凭与国际标准的第三个相同。 And in France we have first = first int., second = second int., third = third int.在法国,我们有 first = first int., second = second int.,third = third int.。 and no fourth diplom.也没有第四个文凭。 Then I have a the table:然后我有一张桌子:

country= c( "Germany", "Germany", "Germany", "France" , "France", "France")
degree_one= c( 1, 1, 1, 1 , 1, 1)
degree_two = c( 0, 1, 0, 1 , 1, 0)
degree_three= c( 1, 0, 1, 1 , 1, 0)
degree_four = c( 1, 0, 0, NA ,NA,  NA)

f = data.frame(country,degree_one,degree_two,degree_three,degree_four)

Then I can translate and try to creat my variable degree by summing everything:然后我可以翻译并尝试通过对所有内容求和来创建我的可变学位:

f$degree_one = ifelse(f$country == "Germany" & f$degree_one == 1,1,f$degree_one)
f$degree_two = ifelse(f$country == "Germany" & f$degree_two == 1,2,f$degree_two)
f$degree_three = ifelse(f$country == "Germany" & f$degree_three == 1,2,f$degree_three)
f$degree_four = ifelse(f$country == "Germany" & f$degree_four == 1,3,f$degree_four)

f$degree_one = ifelse(f$country == "France" & f$degree_one == 1,1,f$degree_one)
f$degree_two = ifelse(f$country == "France" & f$degree_two == 1,2,f$degree_two)
f$degree_three = ifelse(f$country == "France" & f$degree_three == 1,3,f$degree_three)
f$degree_four = ifelse(f$country == "France" & f$degree_four == "NA",0,f$degree_four)

f = replace(f, is.na(f), 0)

f2 = f %>% mutate(degree = degree_one + degree_two + degree_three + degree_four )

Unfortunately, it does not work and what I would like should look like this:不幸的是,它不起作用,我想要的应该是这样的:

degree = c(3,2,2,3,3,1)
f3 = data.frame(f,degree)

I tried to do smth with a while loop but it did not work, as anyone any idea how I can solve my problem?我试图用一个while循环来做某事,但它没有用,因为有人知道我该如何解决我的问题吗? I tried to make it as clear as possible, I hope you will understand and that someone as an idea on how to fix this.我试图让它尽可能清楚,我希望你能理解,并且有人对如何解决这个问题提出一个想法。

Thanks :)谢谢 :)

Here is an approach using data.table这是一种使用data.table的方法

library(data.table)
##
#  create degree map by country
#
degreeMap <- data.table(country=c('France', 'Germany'))
degreeMap <- degreeMap[, .(degree=paste('degree', c('one', 'two', 'three', 'four'), sep='_')), by=.(country)]
degreeMap[country=='France',  intlDegree:=c(1,2,3,NA)]
degreeMap[country=='Germany', intlDegree:=c(1,2,2,3)]
##
#   process your data
#
setDT(f)
f[, indx:=1:.N]                     # need an index column to recover original order
f[, HH:=1:.N, by=.(country)]        # need a  HH column to distinguish different HH w/in country
maxDegree <- melt(f, id=c('country', 'HH', 'indx'), variable.name='degree', value.name = 'flag')
maxDegree <- maxDegree[flag > 0]    # remove rows with flag=0 or NA
setorder(maxDegree, HH, degree)
maxDegree <- maxDegree[, .SD[.N], keyby=.(country, HH)]
maxDegree[degreeMap, intlDegree:=i.intlDegree, on=.(country, degree)]
setorder(maxDegree, indx)
maxDegree
##    country HH indx       degree flag intlDegree
## 1: Germany  1    1  degree_four    1          3
## 2: Germany  2    2   degree_two    1          2
## 3: Germany  3    3 degree_three    1          2
## 4:  France  1    4 degree_three    1          3
## 5:  France  2    5 degree_three    1          3
## 6:  France  3    6   degree_one    1          1

So this converts your f to a data.table and adds an index column and a HH column to distinguish between HH within a country.因此,这会将您的f转换为data.table并添加一个索引列和一个 HH 列来区分一个国家/地区的 HH。

We then convert to long format using melt(...) .然后我们使用melt(...)转换为长格式。 In long format the four degree_ columns are reduced to two columns: a flag column indicating whether or not the degree applies, and a degree column indicating which degree.在长格式中,四个degree_列被缩减为两列:一个指示度数是否适用的flag列,以及一个指示度数的degree列。

Then we remove all rows with 0 or NA flags, and then extract the last remaining row (highest degree) for each country and HH.然后我们删除所有带有 0 或 NA 标志的行,然后为每个国家和 HH 提取最后剩余的行(最高级别)。

Finally, we join to degreeMap to get the equivalent intlDegree.最后,我们加入degreeMap以获得等效的 intlDegree。

Change NA s to 0 and then sum degree columns:NA更改为0 ,然后对度数列求和:

f <- f %>%
    mutate(
        degree_one = ifelse(is.na(degree_one), 0, degree_one),
        degree_two = ifelse(is.na(degree_two), 0, degree_two),
        degree_three = ifelse(is.na(degree_three), 0, degree_three),
        degree_four = ifelse(is.na(degree_four), 0, degree_four),
        degree_sum = degree_one + degree_two + degree_three + degree_four
)

Or, if you want to get fancy with the dplyr或者,如果你想看中dplyr

f <- f %>% 
    mutate(across(contains("degree"), \(x) {ifelse(is.na(x), 0, x)})) %>% 
    mutate(degree_sum = select(., contains("degree")) %>% rowSums())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM