简体   繁体   English

在 python 中选择功能

[英]Selecting features in python

I am trying to do this algorithm http://venom.cs.utsa.edu/dmz/techrep/2007/CS-TR-2007-011.pdf我正在尝试执行此算法http://venom.cs.utsa.edu/dmz/techrep/2007/CS-TR-2007-011.pdf

import pandas as pd
import pathlib
import gaitrec
from tsfresh import extract_features
from collections import defaultdict
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import euclidean_distances

class PFA(object):
    def __init__(self, n_features, q=None):
        self.q = q
        self.n_features = n_features

    def fit(self, X):
        if not self.q:
            self.q = X.shape[1]
        pca = PCA(n_components=self.q).fit(X)
        A_q = pca.components_.T
        kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
        clusters = kmeans.predict(A_q)
        cluster_centers = kmeans.cluster_centers_
        dists = defaultdict(list)
        for i, c in enumerate(clusters):
            dist = euclidean_distances(A_q[i, :].reshape(1,-1), cluster_centers[c, :].reshape(1,-1))[0][0]
            dists[c].append((i, dist))
        self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
        self.features_ = X[:, self.indices_]


p = pathlib.Path(gaitrec.__file__).parent
dataset_file = p / 'DatasetC' / 'subj_001' / 'walk0' / 'subj_0010.csv'
read_csv = pd.read_csv(dataset_file, sep=';', decimal='.', names=['time','x','y', 'z', 'id'])
read_csv['id'] = 0

if __name__ == '__main__':
    print(read_csv)
    extracted_features = extract_features(read_csv, column_id="id", column_sort="time")
    features_withno_nanvalues = extracted_features.dropna(how='all', axis=1)
    print(features_withno_nanvalues)
    X = features_withno_nanvalues.to_numpy()
    pfa = PFA(n_features=2274, q=1)
    pfa.fit(X)
    Y = pfa.features_
    print(Y) #feature extracted
    column_indices = pfa.indices_ #index of the features
    print(column_indices)

C:\Users\Thund\AppData\Local\Programs\Python\Python37\python.exe C:/Users/Thund/Desktop/RepoBitbucket/Gaitrec/gaitrec/extraction.py
      time         x         y         z  id
0        0 -0.833333  0.416667 -0.041667   0
1        1 -0.833333  0.416667 -0.041667   0
2        2 -0.833333  0.416667 -0.041667   0
3        3 -0.833333  0.416667 -0.041667   0
4        4 -0.833333  0.416667 -0.041667   0
...    ...       ...       ...       ...  ..
1337  1337 -0.833333  0.416667  0.083333   0
1338  1338 -0.833333  0.416667  0.083333   0
1339  1339 -0.916667  0.416667  0.083333   0
1340  1340 -0.958333  0.416667  0.083333   0
1341  1341 -0.958333  0.416667  0.083333   0

[1342 rows x 5 columns]
Feature Extraction: 100%|██████████| 3/3 [00:04<00:00,  1.46s/it]
C:\Users\Thund\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\decomposition\_pca.py:461: RuntimeWarning: invalid value encountered in true_divide
  explained_variance_ = (S ** 2) / (n_samples - 1)
variable  x__abs_energy  ...  z__variation_coefficient
id                       ...                          
0           1430.496338  ...                  5.521904

[1 rows x 2274 columns]
C:/Users/Thund/Desktop/RepoBitbucket/Gaitrec/gaitrec/extraction.py:21: ConvergenceWarning: Number of distinct clusters (2) found smaller than n_clusters (2274). Possibly due to duplicate points in X.
  kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
[[1430.49633789   66.95824   ]]
[0, 1]

Process finished with exit code 0

I don't understand the warnings and the cause that from 2k+ features it only extract the first 2,that's what I did:我不明白警告和从 2k+ 特征中提取前 2 个的原因,这就是我所做的:

  1. Produce the covariance matrix from the original data从原始数据生成协方差矩阵
  2. Compute eigenvectors and eigenvalues of the covariance matrix using the SVD method使用 SVD 方法计算协方差矩阵的特征向量和特征值
  3. Those two steps combined are what you call PCA.这两个步骤结合起来就是你所说的 PCA。 The Principle Components are the eigenvectors of the covariance matrix of the original data and then apply the K-means algorithm.主成分是原始数据的协方差矩阵的特征向量,然后应用K-means算法。

My question are:我的问题是:

  1. How can I fix the warning it gives me?如何修复它给我的警告?
  2. It only select 2 features from 2k+ features, so something is wrong?它只有 select 2k+ 功能中的 2 个功能,所以有什么问题吗?

As mentioned in the comments, the features after the fit are coming from the indices of the A_q matrix, which has a reduced number of features from PCA.如评论中所述,拟合后的特征来自 A_q 矩阵的索引,该矩阵的 PCA 特征数量减少。 You're getting two features instead of q features (1 in this case) because of the reshape.由于重塑,您将获得两个功能而不是 q 个功能(在本例中为 1 个)。 self.features_ should probably come from A_q instead of X. self.features_ 应该可能来自 A_q 而不是 X。

I think the problem in your code is in the following statement:我认为您的代码中的问题出在以下语句中:

pfa = PFA(n_features=2274, q=1)

I haven't read the paper, but you have to observe pca behavior.我没有读过这篇论文,但你必须观察pca的行为。 If the authors set q variable to 1, you should see why q is 1.如果作者将q变量设置为 1,您应该明白为什么q为 1。

For instance:例如:

from matplotlib.pyplot import plot
from matplotlib.pyplot import xlabel
from matplotlib.pyplot import ylabel
from matplotlib.pyplot import figure

pca_obj = PCA().fit(X=X)
figure(1, figsize=(6,3), dpi=300)
plot(pca_obj.explained_variance_, linewidth=2)
xlabel('Components')
ylabel('Explained Variaces')

Note: If you are using an application other than jupyter-notebook please add show at the end of the line, in case you couldn't see any graph注意:如果您使用jupyter-notebook以外的应用程序,请在行尾添加show ,以防您看不到任何图表

from matplotlib.pyplot import plot
from matplotlib.pyplot import xlabel
from matplotlib.pyplot import ylabel
from matplotlib.pyplot import figure
from matplotlib.pyplot import show

pca_obj = PCA().fit(X=X)
figure(1, figsize=(6,3), dpi=300)
plot(pca_obj.explained_variance_, linewidth=2)
xlabel('Components')
ylabel('Explained Variaces')
show()

For my dataset, the result is:对于我的数据集,结果是:

在此处输入图像描述

Now, I can say: "My q variable is 100, since PCA performs better starting with 100 components."现在,我可以说:“我的q变量是 100,因为 PCA 从 100 个组件开始表现更好。”

Can say the same?可以说一样吗? How do you know q is 1?你怎么知道q是1?

Now observe your best q performance variable, see if it solves your problem.现在观察你最好的q性能变量,看看它是否能解决你的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM