简体   繁体   English

R因子和水平

[英]R factor and level

Levels make sense that it is unique values of the vector, but I can't get my head around what factor is. 级别有道理,它是向量的唯一值,但我无法理解是什么因素。 It just seems to repeat the vector values. 似乎只是重复了向量值。

factor(c(1,2,3,3,4,5,1))
[1] 1 2 3 3 4 5 1
Levels: 1 2 3 4 5

Can anyone explain what factor is supposed to do, or why would I used it? 谁能解释应该做的是什么因素,或者我为什么要使用它?

I'm starting to wonder if factors are like a code table in a database. 我开始怀疑因素是否像数据库中的代码表。 Where the factor name is code table name and levels are the unique options of the code table. 因子名称是代码表名称,级别是代码表的唯一选项。 ?

A factor is stored as a hash table rather than raw character vector. 因素存储为哈希表,而不是原始字符向量。 What does this imply? 这意味着什么? There are two major benefits. 有两个主要好处。

  1. Much smaller memory footprint. 内存占用空间小得多。 Consider a text file containing the phrase "New Jersey" 100,000 times over encoded in ASCII. 考虑一个文本文件,该文本文件中的短语“ New Jersey”超过ASCII编码的100,000倍。 Now imagine if you just had to store the number 16 (in binary 100,000 times and then another table indicating that 16 means "New Jersey". It's leaner and faster. 现在想象一下,如果您只需要存储数字16(以100,000次二进制存储,然后存储另一个表,该数字表示16表示“新泽西州”。它更精简,更快。

  2. Especially for visualization and statistical analysis, frequently we test for values "across all categories" (think ANOVA or what you would color a stacked barplot by). 尤其是对于可视化和统计分析,我们经常测试“所有类别”中的值(请考虑使用方差分析(ANOVA)或对堆积的条形图进行着色的方式)。 We can either repeatedly encode all of our functions to stack up observed choices in a string vector or we can simply create a new type of vector which will tell you what the valid choices are. 我们可以重复编码所有函数以将观察到的选择堆叠在字符串向量中,也可以简单地创建一种新型向量来告诉您有效选择是什么。 That is called a factor, and the valid choices are called levels. 那就是一个因素,有效的选择就是水平。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM