[英]Selecting features in python
I am trying to do this algorithm http://venom.cs.utsa.edu/dmz/techrep/2007/CS-TR-2007-011.pdf我正在尝试执行此算法http://venom.cs.utsa.edu/dmz/techrep/2007/CS-TR-2007-011.pdf
import pandas as pd
import pathlib
import gaitrec
from tsfresh import extract_features
from collections import defaultdict
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import euclidean_distances
class PFA(object):
def __init__(self, n_features, q=None):
self.q = q
self.n_features = n_features
def fit(self, X):
if not self.q:
self.q = X.shape[1]
pca = PCA(n_components=self.q).fit(X)
A_q = pca.components_.T
kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
clusters = kmeans.predict(A_q)
cluster_centers = kmeans.cluster_centers_
dists = defaultdict(list)
for i, c in enumerate(clusters):
dist = euclidean_distances(A_q[i, :].reshape(1,-1), cluster_centers[c, :].reshape(1,-1))[0][0]
dists[c].append((i, dist))
self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
self.features_ = X[:, self.indices_]
p = pathlib.Path(gaitrec.__file__).parent
dataset_file = p / 'DatasetC' / 'subj_001' / 'walk0' / 'subj_0010.csv'
read_csv = pd.read_csv(dataset_file, sep=';', decimal='.', names=['time','x','y', 'z', 'id'])
read_csv['id'] = 0
if __name__ == '__main__':
print(read_csv)
extracted_features = extract_features(read_csv, column_id="id", column_sort="time")
features_withno_nanvalues = extracted_features.dropna(how='all', axis=1)
print(features_withno_nanvalues)
X = features_withno_nanvalues.to_numpy()
pfa = PFA(n_features=2274, q=1)
pfa.fit(X)
Y = pfa.features_
print(Y) #feature extracted
column_indices = pfa.indices_ #index of the features
print(column_indices)
C:\Users\Thund\AppData\Local\Programs\Python\Python37\python.exe C:/Users/Thund/Desktop/RepoBitbucket/Gaitrec/gaitrec/extraction.py
time x y z id
0 0 -0.833333 0.416667 -0.041667 0
1 1 -0.833333 0.416667 -0.041667 0
2 2 -0.833333 0.416667 -0.041667 0
3 3 -0.833333 0.416667 -0.041667 0
4 4 -0.833333 0.416667 -0.041667 0
... ... ... ... ... ..
1337 1337 -0.833333 0.416667 0.083333 0
1338 1338 -0.833333 0.416667 0.083333 0
1339 1339 -0.916667 0.416667 0.083333 0
1340 1340 -0.958333 0.416667 0.083333 0
1341 1341 -0.958333 0.416667 0.083333 0
[1342 rows x 5 columns]
Feature Extraction: 100%|██████████| 3/3 [00:04<00:00, 1.46s/it]
C:\Users\Thund\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\decomposition\_pca.py:461: RuntimeWarning: invalid value encountered in true_divide
explained_variance_ = (S ** 2) / (n_samples - 1)
variable x__abs_energy ... z__variation_coefficient
id ...
0 1430.496338 ... 5.521904
[1 rows x 2274 columns]
C:/Users/Thund/Desktop/RepoBitbucket/Gaitrec/gaitrec/extraction.py:21: ConvergenceWarning: Number of distinct clusters (2) found smaller than n_clusters (2274). Possibly due to duplicate points in X.
kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
[[1430.49633789 66.95824 ]]
[0, 1]
Process finished with exit code 0
I don't understand the warnings and the cause that from 2k+ features it only extract the first 2,that's what I did:我不明白警告和从 2k+ 特征中提取前 2 个的原因,这就是我所做的:
My question are:我的问题是:
As mentioned in the comments, the features after the fit are coming from the indices of the A_q matrix, which has a reduced number of features from PCA.如评论中所述,拟合后的特征来自 A_q 矩阵的索引,该矩阵的 PCA 特征数量减少。 You're getting two features instead of q features (1 in this case) because of the reshape.
由于重塑,您将获得两个功能而不是 q 个功能(在本例中为 1 个)。 self.features_ should probably come from A_q instead of X.
self.features_ 应该可能来自 A_q 而不是 X。
I think the problem in your code is in the following statement:我认为您的代码中的问题出在以下语句中:
pfa = PFA(n_features=2274, q=1)
I haven't read the paper, but you have to observe pca
behavior.我没有读过这篇论文,但你必须观察
pca
的行为。 If the authors set q
variable to 1, you should see why q
is 1.如果作者将
q
变量设置为 1,您应该明白为什么q
为 1。
For instance:例如:
from matplotlib.pyplot import plot
from matplotlib.pyplot import xlabel
from matplotlib.pyplot import ylabel
from matplotlib.pyplot import figure
pca_obj = PCA().fit(X=X)
figure(1, figsize=(6,3), dpi=300)
plot(pca_obj.explained_variance_, linewidth=2)
xlabel('Components')
ylabel('Explained Variaces')
Note: If you are using an application other than jupyter-notebook
please add show
at the end of the line, in case you couldn't see any graph注意:如果您使用
jupyter-notebook
以外的应用程序,请在行尾添加show
,以防您看不到任何图表
from matplotlib.pyplot import plot
from matplotlib.pyplot import xlabel
from matplotlib.pyplot import ylabel
from matplotlib.pyplot import figure
from matplotlib.pyplot import show
pca_obj = PCA().fit(X=X)
figure(1, figsize=(6,3), dpi=300)
plot(pca_obj.explained_variance_, linewidth=2)
xlabel('Components')
ylabel('Explained Variaces')
show()
For my dataset, the result is:对于我的数据集,结果是:
Now, I can say: "My q
variable is 100, since PCA performs better starting with 100 components."现在,我可以说:“我的
q
变量是 100,因为 PCA 从 100 个组件开始表现更好。”
Can say the same?可以说一样吗? How do you know
q
is 1?你怎么知道
q
是1?
Now observe your best q
performance variable, see if it solves your problem.现在观察你最好的
q
性能变量,看看它是否能解决你的问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.