简体   繁体   English

如何在R中优化这些循环

[英]How to optimise these loops in R

I'm in the process of cleaning data and have ended up with a lot of for loops. 我正在清理数据的过程中,最终遇到了很多for循环。 Since my data set has more than 6 million rows, this is a bit of a problem for me, but I'm not sure how to avoid it. 由于我的数据集有超过600万行,这对我来说有点问题,但是我不确定如何避免。

An example of my data set (called sentencing.df) would be something like: 我的数据集(称为sendencing.df)的示例如下:

    Ethnicity     PersonNumber

    Caucasian     1
    Caucasian     1
    Unknown       1
    Indian        2
    Indian        2

I want to compare within the same person number - for example, I want to know whether the ethnicities for each person number are the same (and then to change the incorrect entries if they exist). 我想在同一个人编号内进行比较-例如,我想知道每个个人编号的种族是否相同(如果存在错误,请更改不正确的条目)。 My code uses for loops and looks something like this: 我的代码使用for循环,看起来像这样:

PersonListRace <- unique(sentencing.df[sentencing.df$ethnicity == "UNKNOWN",]$PersonNumber) 
PersonListRace <- as.numeric(as.character(PersonListRace))
 # vector of person numbers for those with ethnicity UNKNOWN

for (i in 1:100) {
  race <- sentencing.df[sentencing.df$PersonNumber == PersonListRace[i],]$ethnicity
    # creates a vector of unique ethnicities for that person
  if (length(unique(race)) != 2) {next}
    # excludes those who only have UNKNOWN or who have UNKNOWN plus multiple ethnicities
  else {
   label <- as.character(unique(race[which(race != "UNKNOWN")]))
   sentencing.df[sentencing.df$PersonNumber == PersonListRace[i],]$ethnicity <- label
  }
}

I then have similar things for all my other variables, and the for loops take far too long to run. 然后,我的所有其他变量都有类似的情况,并且for循环花费的时间太长而无法运行。 I've looked at some of the other questions and answers on the site, but my main problem is that I can't find a way to compare only within the same person number across a different variable, without using a for loop. 我已经看过了网站上的其他一些问题和答案,但是我的主要问题是,我找不到一种方法,只能在不使用for循环的情况下,仅在同一个人编号中跨不同的变量进行比较。

Anything that would help me achieve my aim in a practical timeframe would be very much appreciated :) 在实际的时间内能帮助我达到目标的任何事情,将不胜感激:)

Neither of my concerns were addressed in the comment so I will just take the example as being fully representative of the complexity of the problem (although my experience is that things are rarely so simple); 我的任何担忧都未在评论中得到解决,因此我仅以该示例完全代表问题的复杂性为例(尽管我的经验是事情很少那么简单);

dat <- read.table(text="Ethnicity     PersonNumber
     Caucasian     1
     Caucasian     1
     Unknown       1
     Indian        2
     Indian        2", header=TRUE)
 dat$TrueEth <- with( dat, ave(Ethnicity, PersonNumber, 
                               FUN=function(perE){
                                              unique( perE[perE != "Unknown"] ) } ) )

> dat
  Ethnicity PersonNumber   TrueEth
1 Caucasian            1 Caucasian
2 Caucasian            1 Caucasian
3   Unknown            1 Caucasian
4    Indian            2    Indian
5    Indian            2    Indian

The outstanding issues are what to do with more than one value for Ethnicity and if the answer is majority rules what to do if there are an equal number of not-Unknown. 悬而未决的问题是如何处理不止一个种族的价值观,如果答案是多数制,那么如果存在相同数量的未知数,该怎么办。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM