简体   繁体   English

在R中分类数据集

[英]Categorize dataset in R

I am having an issue with categorizing a dataset. 我在对数据集进行分类时遇到问题。

The dataset is an matrix in which the rows are observations, and the columns are the features. 数据集是一个矩阵,其中的行是观察值,而列是要素。 Each features value is between 0 - 1. The dataset is used for training purposes, and since the method I am going to use is vary sensitive to small variation, the dataset has to formatted to not be sensitive. 每个特征值都在0到1之间。该数据集用于训练目的,由于我要使用的方法对小变化敏感,因此必须对数据集进行格式化以使其不敏感。

My idea was that instead of providing the raw data i want to bin the feature values into bins according to their numeric value, and provide the middle value of the bin as the training data for the training. 我的想法是,与其提供原始数据,我不希望根据特征值的数值将特征值分类到bin中,并提供bin中间值作为训练的训练数据。

Ex. 例如 bins being (1-2,2-3,3-4,4-5,5-6,6-7,7-8,8-9,9-10) 箱为(1-2,2-3,3-4,4-5,5-6,6-7,7-8,8-9,9-10)

dataset #original dataset
>         [,1] [,2] [,3] [,4] [,5]
[1,]    8.1    5.3   10    4.4    4.6
[2,]    5.2   10    3.2    9.3    3.5
[3,]    7.3    1.6    9    8.9    8.4
[4,]    6.4    2.8    8    6.5    9.3
[5,]   10    4.3    2.2    1.1    5.3

transformed_dataset #binned dataset


>         [,1] [,2] [,3] [,4] [,5]
[1,]    8.5    5.5   9.5   4.5    4.5
[2,]    5.5   9.5   3.5    9.5   3.5
[3,]    7.5    1.5   8.5    8.5    8.5
[4,]    6.5    2.5   8.5    6.5    9.5
[5,]    9.5    4.5  2.5    1.5    5.5

I am not sure on how i should bin the data like this, and give it as input for naiveBayes from the library("lattice") . 我不确定如何将这样的数据进行naiveBayes ,并将其作为来自library("lattice") naiveBayes输入。 I know that signif is capable of rounding the value to a giving number of digits, and thus "bining it", but i can't actually determine the number of bins. 我知道signif可以将值舍入为给定的数字位数,从而“将其绑定”,但是我实际上无法确定bin的数量。

Binning seems a way to improve the classification. 分级似乎是改善分类的一种方法。 But how to provide it as a input, that I am not certain of. 但是我不确定如何提供它作为输入。

Updata about the data.frame 关于data.frame的更新

I think i forgot to mention it, but the data is stored in a data.frame , and the way i access the data is by $data. 我想我忘了提到它,但是数据存储在data.frame ,而我访问数据的方式是通过$ data进行的。 the data.frame all provide labels for each observation which can be accessed by $labels. data.frame都为每个观察提供标签,可以通过$ labels访问。

Hm. You may have some trouble with the data types here, because matrix class does not work well with factors - and the binning intervals are best described by factors. 您在此处可能会遇到一些数据类型的麻烦,因为matrix类不能很好地与因子配合使用-合并间隔最好由因子来描述。

In order to do the binning, you can use the cut function from the base R installation, such as: 为了进行装箱,可以使用基本R安装中的cut功能,例如:

> data <- c(1,2,4,1,5,3,3,5,2,2,5,5,5,7,8,9,5,3,2,6,8,9,3,1)
> breaks <- c(0, 3, 6, 9)
> cut(data, breaks=breaks)
 [1] (0,3] (0,3] (3,6] (0,3] (3,6] (0,3] (0,3] (3,6] (0,3] (0,3] (3,6] (3,6]
 [13] (3,6] (6,9] (6,9] (6,9] (3,6] (0,3] (0,3] (3,6] (6,9] (6,9] (0,3] (0,3]
 Levels: (0,3] (3,6] (6,9]

Or, using the left interval: 或者,使用左间隔:

> cut(data, breaks=breaks, right=FALSE)
 [1] [0,3) [0,3) [3,6) [0,3) [3,6) [3,6) [3,6) [3,6) [0,3) [0,3) [3,6) [3,6)
[13] [3,6) [6,9) [6,9) <NA>  [3,6) [3,6) [0,3) [6,9) [6,9) <NA>  [3,6) [0,3)
Levels: [0,3) [3,6) [6,9)

Notice that the breaks you provide should cover the entire dataset, or else you will get some NA s. 请注意,您提供的中断应该覆盖整个数据集,否则您将获得一些NA

A simple solution could be like this: 一个简单的解决方案可能是这样的:

d <- matrix(c(8.1, 5.3, 10, 4.4, 4.6,
              5.2, 10, 3.2, 9.3, 3.5,
              7.3, 1.6, 9, 8.9, 8.4,
              6.4, 2.8, 8, 6.5, 9.3,
              10, 4.3, 2.2, 1.1, 5.3), nrow=5, ncol=5, byrow=TRUE)

d <- as.data.frame(apply(d, 2, function(column) {
  as.factor(round(column+0.5)-0.5)
}))

Leading to results: 导致结果:

> d
   V1  V2  V3  V4  V5
1 8.5 5.5 9.5 4.5 4.5
2 5.5 9.5 3.5 9.5 3.5
3 7.5 1.5 9.5 8.5 8.5
4 6.5 2.5 7.5 6.5 9.5
5 9.5 4.5 2.5 1.5 5.5

After the transformation, the columns of your dataset are factors, meaning that naiveBayes will not treat them as numeric but as categorical variables. 转换后,数据集的列就是因子,这意味着naiveBayes不会将其视为数字变量而是分类变量。

> class(d[,1])
[1] "factor"
> levels(d[,1])
[1] "5.5" "6.5" "7.5" "8.5" "9.5"

Note that the trick of adding and removing 0.5 will fail if you have any value equal to 0 - it will assign it to level "0" instead of "0.5". 请注意,如果您具有等于0的任何值,那么添加和删除0.5的技巧将失败-它将分配给级别“ 0”而不是“ 0.5”。 You could solve it adding this line to the function: 您可以解决此问题,并将此行添加到函数中:

column[which(column == 0)] <- 0.5

Hope it helps. 希望能帮助到你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM