简体   繁体   中英

How to optimise these loops in R

I'm in the process of cleaning data and have ended up with a lot of for loops. Since my data set has more than 6 million rows, this is a bit of a problem for me, but I'm not sure how to avoid it.

An example of my data set (called sentencing.df) would be something like:

    Ethnicity     PersonNumber

    Caucasian     1
    Caucasian     1
    Unknown       1
    Indian        2
    Indian        2

I want to compare within the same person number - for example, I want to know whether the ethnicities for each person number are the same (and then to change the incorrect entries if they exist). My code uses for loops and looks something like this:

PersonListRace <- unique(sentencing.df[sentencing.df$ethnicity == "UNKNOWN",]$PersonNumber) 
PersonListRace <- as.numeric(as.character(PersonListRace))
 # vector of person numbers for those with ethnicity UNKNOWN

for (i in 1:100) {
  race <- sentencing.df[sentencing.df$PersonNumber == PersonListRace[i],]$ethnicity
    # creates a vector of unique ethnicities for that person
  if (length(unique(race)) != 2) {next}
    # excludes those who only have UNKNOWN or who have UNKNOWN plus multiple ethnicities
  else {
   label <- as.character(unique(race[which(race != "UNKNOWN")]))
   sentencing.df[sentencing.df$PersonNumber == PersonListRace[i],]$ethnicity <- label
  }
}

I then have similar things for all my other variables, and the for loops take far too long to run. I've looked at some of the other questions and answers on the site, but my main problem is that I can't find a way to compare only within the same person number across a different variable, without using a for loop.

Anything that would help me achieve my aim in a practical timeframe would be very much appreciated :)

Neither of my concerns were addressed in the comment so I will just take the example as being fully representative of the complexity of the problem (although my experience is that things are rarely so simple);

dat <- read.table(text="Ethnicity     PersonNumber
     Caucasian     1
     Caucasian     1
     Unknown       1
     Indian        2
     Indian        2", header=TRUE)
 dat$TrueEth <- with( dat, ave(Ethnicity, PersonNumber, 
                               FUN=function(perE){
                                              unique( perE[perE != "Unknown"] ) } ) )

> dat
  Ethnicity PersonNumber   TrueEth
1 Caucasian            1 Caucasian
2 Caucasian            1 Caucasian
3   Unknown            1 Caucasian
4    Indian            2    Indian
5    Indian            2    Indian

The outstanding issues are what to do with more than one value for Ethnicity and if the answer is majority rules what to do if there are an equal number of not-Unknown.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM