[英]Categorize dataset in R
I am having an issue with categorizing a dataset. 我在对数据集进行分类时遇到问题。
The dataset is an matrix in which the rows are observations, and the columns are the features. 数据集是一个矩阵,其中的行是观察值,而列是要素。 Each features value is between 0 - 1. The dataset is used for training purposes, and since the method I am going to use is vary sensitive to small variation, the dataset has to formatted to not be sensitive.
每个特征值都在0到1之间。该数据集用于训练目的,由于我要使用的方法对小变化敏感,因此必须对数据集进行格式化以使其不敏感。
My idea was that instead of providing the raw data i want to bin the feature values into bins according to their numeric value, and provide the middle value of the bin as the training data for the training. 我的想法是,与其提供原始数据,我不希望根据特征值的数值将特征值分类到bin中,并提供bin的中间值作为训练的训练数据。
Ex. 例如 bins being (1-2,2-3,3-4,4-5,5-6,6-7,7-8,8-9,9-10)
箱为(1-2,2-3,3-4,4-5,5-6,6-7,7-8,8-9,9-10)
dataset #original dataset
> [,1] [,2] [,3] [,4] [,5]
[1,] 8.1 5.3 10 4.4 4.6
[2,] 5.2 10 3.2 9.3 3.5
[3,] 7.3 1.6 9 8.9 8.4
[4,] 6.4 2.8 8 6.5 9.3
[5,] 10 4.3 2.2 1.1 5.3
transformed_dataset #binned dataset
> [,1] [,2] [,3] [,4] [,5]
[1,] 8.5 5.5 9.5 4.5 4.5
[2,] 5.5 9.5 3.5 9.5 3.5
[3,] 7.5 1.5 8.5 8.5 8.5
[4,] 6.5 2.5 8.5 6.5 9.5
[5,] 9.5 4.5 2.5 1.5 5.5
I am not sure on how i should bin the data like this, and give it as input for naiveBayes
from the library("lattice")
. 我不确定如何将这样的数据进行
naiveBayes
,并将其作为来自library("lattice")
naiveBayes
输入。 I know that signif
is capable of rounding the value to a giving number of digits, and thus "bining it", but i can't actually determine the number of bins. 我知道
signif
可以将值舍入为给定的数字位数,从而“将其绑定”,但是我实际上无法确定bin的数量。
Binning seems a way to improve the classification. 分级似乎是改善分类的一种方法。 But how to provide it as a input, that I am not certain of.
但是我不确定如何提供它作为输入。
Updata about the data.frame 关于data.frame的更新
I think i forgot to mention it, but the data is stored in a data.frame
, and the way i access the data is by $data. 我想我忘了提到它,但是数据存储在
data.frame
,而我访问数据的方式是通过$ data进行的。 the data.frame all provide labels for each observation which can be accessed by $labels. data.frame都为每个观察提供标签,可以通过$ labels访问。
Hm. 嗯 You may have some trouble with the data types here, because
matrix
class does not work well with factors - and the binning intervals are best described by factors. 您在此处可能会遇到一些数据类型的麻烦,因为
matrix
类不能很好地与因子配合使用-合并间隔最好由因子来描述。
In order to do the binning, you can use the cut
function from the base R installation, such as: 为了进行装箱,可以使用基本R安装中的
cut
功能,例如:
> data <- c(1,2,4,1,5,3,3,5,2,2,5,5,5,7,8,9,5,3,2,6,8,9,3,1)
> breaks <- c(0, 3, 6, 9)
> cut(data, breaks=breaks)
[1] (0,3] (0,3] (3,6] (0,3] (3,6] (0,3] (0,3] (3,6] (0,3] (0,3] (3,6] (3,6]
[13] (3,6] (6,9] (6,9] (6,9] (3,6] (0,3] (0,3] (3,6] (6,9] (6,9] (0,3] (0,3]
Levels: (0,3] (3,6] (6,9]
Or, using the left interval: 或者,使用左间隔:
> cut(data, breaks=breaks, right=FALSE)
[1] [0,3) [0,3) [3,6) [0,3) [3,6) [3,6) [3,6) [3,6) [0,3) [0,3) [3,6) [3,6)
[13] [3,6) [6,9) [6,9) <NA> [3,6) [3,6) [0,3) [6,9) [6,9) <NA> [3,6) [0,3)
Levels: [0,3) [3,6) [6,9)
Notice that the breaks you provide should cover the entire dataset, or else you will get some NA
s. 请注意,您提供的中断应该覆盖整个数据集,否则您将获得一些
NA
。
A simple solution could be like this: 一个简单的解决方案可能是这样的:
d <- matrix(c(8.1, 5.3, 10, 4.4, 4.6,
5.2, 10, 3.2, 9.3, 3.5,
7.3, 1.6, 9, 8.9, 8.4,
6.4, 2.8, 8, 6.5, 9.3,
10, 4.3, 2.2, 1.1, 5.3), nrow=5, ncol=5, byrow=TRUE)
d <- as.data.frame(apply(d, 2, function(column) {
as.factor(round(column+0.5)-0.5)
}))
Leading to results: 导致结果:
> d
V1 V2 V3 V4 V5
1 8.5 5.5 9.5 4.5 4.5
2 5.5 9.5 3.5 9.5 3.5
3 7.5 1.5 9.5 8.5 8.5
4 6.5 2.5 7.5 6.5 9.5
5 9.5 4.5 2.5 1.5 5.5
After the transformation, the columns of your dataset are factors, meaning that naiveBayes
will not treat them as numeric but as categorical variables. 转换后,数据集的列就是因子,这意味着
naiveBayes
不会将其视为数字变量而是分类变量。
> class(d[,1])
[1] "factor"
> levels(d[,1])
[1] "5.5" "6.5" "7.5" "8.5" "9.5"
Note that the trick of adding and removing 0.5 will fail if you have any value equal to 0 - it will assign it to level "0" instead of "0.5". 请注意,如果您具有等于0的任何值,那么添加和删除0.5的技巧将失败-它将分配给级别“ 0”而不是“ 0.5”。 You could solve it adding this line to the function:
您可以解决此问题,并将此行添加到函数中:
column[which(column == 0)] <- 0.5
Hope it helps. 希望能帮助到你。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.