简体   繁体   English

使用 R 中的标签将数字转换为因子

[英]Convert Number to Factor using Labels in R

I have a column in my dataset that has various different numeric values in it.我的数据集中有一列,其中包含各种不同的数值。 However, 3 of the numbers have a specific label, while all others have a general label.但是,其中 3 个数字具有特定的 label,而所有其他数字具有通用 label。 Going through the dataset one by one is not an option.逐个浏览数据集不是一种选择。 It is a very large dataset with 167K obs.这是一个非常大的数据集,包含 167K obs。

Below shows all the unique values that are in the column:下面显示了列中的所有唯一值:

> unique(NYC_2019_Arrests$JURISDICTION_CODE)
Levels: 0 1 2 3 4 6 7 9 11 12 13 14 15 16 69 71 72 73 74 76 79 85 87 88 97

The levels of JURISDICTION_CODE are defined as follows: JURISDICTION_CODE的级别定义如下:

JURISDICTION_CODE - Jurisdiction responsible for arrest. JURISDICTION_CODE - 负责逮捕的司法管辖区。 Jurisdiction codes 0(Patrol), 1(Transit) and 2(Housing) represent NYPD whilst codes 3 and more represent non NYPD jurisdictions.辖区代码 0(巡逻)、1(交通)和 2(住房)代表纽约警察局,而代码 3 和更多代表非纽约警察局辖区。

This is the code that I tried to get it to work but just returns an error:这是我试图让它工作但只返回一个错误的代码:

> NYC_2019_Arrests$JURISDICTION_CODE <- factor(NYC_2019_Arrests$JURISDICTION_CODE, levels = c(0,1,2, 3:100), labels = c("Patrol", "Transit", "Housing", "Non-NYPD Jurisdiction"))
Error in factor(NYC_2019_Arrests$JURISDICTION_CODE, levels = c(0, 1, 2,  : 
  invalid 'labels'; length 4 should be 1 or 101

I also tried the above code by taking out the 3:100 and leave in the label but that also did not work.我还通过取出 3:100 并留在 label 中尝试了上述代码,但这也不起作用。

It would be greatly appreciated if anybody here would know how to make it that all values 3 and above has the generic without having to type out all of the numbers individually.如果这里有人知道如何使所有 3 及以上的值都具有通用性,而不必单独输入所有数字,将不胜感激。

Thanks!谢谢!

The error message is providing some direction.错误消息提供了一些方向。 The problem is that the labels vector is of length 4 but your levels are length 101. I think you are almost there with the original code.问题是标签向量的长度为 4,但您的级别长度为 101。我认为您几乎可以使用原始代码。 Just make the labels to the correct length with:只需使用以下命令将标签设置为正确的长度:

reps<-rep("Non-NYPD Jurisdiction",98)
NYC_2019_Arrests$JURISDICTION_CODE <- factor(NYC_2019_Arrests$JURISDICTION_CODE, levels = c(0,1,2, 3:100), labels = c("Patrol", "Transit", "Housing", reps))

Edit with explanation:编辑说明:

Run this code for additional explanation.运行此代码以获取更多说明。

#The key is that labels needs the same vector length as level

#length of levels
levels <- c(0,1,2, 3:100)
print(length(levels))
#length of original levels
labels = c("Patrol", "Transit", "Housing", "Non-NYPD Jurisdiction")
print(length(labels))
#This is problematic because what happens for when level - 4. labels[4] would be null.
#Therefore need to repeat "Non-NYPD Jurisdiction" for each level
#since length(3:100) is 98 that is how we know we need 98
reps<-rep("Non-NYPD Jurisdiction",98)
labels <- c("Patrol", "Transit", "Housing", reps)
print(length(labels))

There are several ways to solve this.有几种方法可以解决这个问题。 The simplest and best way I can think of is to use case_when from dplyr Here is an example:我能想到的最简单和最好的方法是从dplyr case_when是一个例子:

library(dplyr)

case_when(mtcars$carb == 1 ~ "One",
          mtcars$carb == 2 ~ "Two",
          mtcars$carb >= 3 ~ "Three or More")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM