简体   繁体   中英

How to make the following “apply” function in R more efficient?

I am trying to bucket certain features into groups. The data.frame below (grouped) is my "key" (think Excel vlookup):

          Original  Grouped
1         Features Constant
2     PhoneService Constant
3    PhoneServices Constant
4       Surcharges Constant
5     CallingPlans Constant
6            Taxes Constant
7          LDUsage    Noise
8    RegionalUsage    Noise
9       LocalUsage    Noise
10       Late fees    Noise
11 SpecialServices    Noise
12         TFUsage    Noise
13       VoipUsage    Noise
14         CCUsage    Noise
15         Credits  Credits
16         OneTime  OneTime

I then reference my database which has a column (BillSection) that takes on a specific value from grouped$Original, and I want to group it according to grouped$Grouped. I am using the sapply function to perform this operation. Then I cbind the resulting output to my original data.frame.

grouper<-as.character(sapply(as.character(bill.data$BillSection[1:100]), # for the first 100 records of the data.frame bill.data
       function(x)grouped[grouped$Original==x,2])) # take the second column, i.e. Grouped, for the corresponding "TRUE" value in Original
cbind(bill.data[1:100,],as.data.frame(grouper))

The above code works as expected, but it's slow when I apply it to my whole database, which exceeds 10,000,000 unique records. Is there an alternative to this method? I know I can use plyr, but it's even slower (I think) than sapply. I was trying to figure it out with data.table but no luck. Any suggestions would be helpful. I am open to coding this in Python, which I am new to, but heard is much faster than R, since I am dealing with large datasets very often. I wanted to know if R can do this fast enough to be useful.

Thanks!

I'm not sure I understand your question, but can you use merge() ? ie something like...

merge(big.df, group.names.df, by.x='orginal.column.in.big.df',
      by.y='original', all.x=T)

NB. Plyr has a parallel option...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM