简体   繁体   English

为R中的级别分配新值

[英]Assign new values to levels in R

all, 所有,

I have a large data set (over 2 million rows), and in one of the columns I have the following levels: 我有一个大数据集(超过200万行),并且在其中一列中,我具有以下级别:

"0"     "0.001" "1"     "4"     "4.001" "8.001"

I want to make a new column where each of those has a new, corresponding letter: 我想创建一个新列,其中每个都有一个新的对应字母:

0 = x, 0.001 = D, 1 = C, 4 and 4.001 = B, and 8.001 = A 0 = x,0.001 = D,1 = C,4和4.001 = B和8.001 = A

Is there a way to do this without using a for loops with 6 if statements? 有没有一种方法可以不使用带有6条if语句的for循环? I tried that, and it was taking forever to run. 我试过了,这花了很多时间。

Here's a test sample: 这是一个测试样本:

      a b
1 0.000 x
2 4.000 B
3 1.000 C
4 0.001 D
5 1.000 C
6 4.000 B
7 4.001 B
8 1.000 C
9 8.001 A

Thank you. 谢谢。

The easiest way would be to create a key/value dataset and join with the original data 最简单的方法是创建键/值数据集并与原始数据连接

keyval <- data.frame(a = c(0, 0.001, 1, 4, 4.001, 8.001), 
     b = c('x', 'D', 'C', 'B', 'B', 'A'), stringsAsFactors= FALSE)
library(data.table)
setDT(df1)[keyval, b := b, on = .(a)]
df1
#       a b
#1: 0.000 x
#2: 4.000 B
#3: 1.000 C
#4: 0.001 D
#5: 1.000 C
#6: 4.000 B
#7: 4.001 B
#8: 1.000 C
#9: 8.001 A

data 数据

df1 <- structure(list(a = c(0, 4, 1, 0.001, 1, 4, 4.001, 1, 8.001)), 
    .Names = "a", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9"), class = "data.frame")

I do not believe there is a single line command that can do it for you. 我不相信有任何一行命令可以为您做到这一点。 BTW for loops by nature are inefficient and not recommended for large data sets. BTW for自然循环效率低,不建议用于大型数据集。

Option 1: 选项1:
What you may want to try is logical indexing which is a statistical implementation of bit array . 您可能想尝试的是logical indexing ,它是位数组的统计实现。

idx<- df$a == "0.000"
df$NewColumn[idx] <- "x"

idx<- df$a == "4.000"
df$NewColumn[idx] <- "B"

and so on and so forth... 等等等等...

Option 2: 选项2:
Use plyr and revalue which is a simpler implementation however could be more compute intensive than option 1. Should still easily work for your data size. 使用plyrrevalue这是一个简单的实现却可能是更多的计算比选择1集约化应该还是很容易为你的数据大小的工作。

library(plyr)
df$NewColumn <- revalue(df$a, c(0 = "x", 0.001 = "D", 1 = "C", 4 = "B", 4.001 = "B", and 8.001 = "A"))

For either option, make sure that the data type class is provided correctly. 对于这两个选项,请确保正确提供了数据类型class From your example, its hard for me to tell if the data is factor or numeric but either ways, its a simple change to manage in my sample code. 从您的示例中,我很难分辨数据是factor还是numeric但是无论哪种方式,这都是在示例代码中进行管理的简单更改。

尝试as.factor(x,等级= c(无论等级和数值由逗号分隔))

I would try this, not shure about the runtime though: 我会尝试这样做,尽管不能保证运行时:

library(forcats)
df = data.frame(a = c("0", "0.001", "1", "4", "4.001", "8.001"))
df$b <- fct_recode(df$a,
               X = "0",
               D = "0.001",
               C = "1",
               B = "4",
               B = "4.001",
               A = "8.001")

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM