简体   繁体   English

使用Python / Pandas从大型csv文件获取相关矩阵时出现问题

[英]Problem getting correlation matrix from large csv file with Python / Pandas

First of all: I'm a beginner with python and data analytics BUT I'm confident I understand the concepts enough so you don't have to over-simplify your answers. 首先:我是python和数据分析的初学者,但我有信心我对这些概念已经足够了解,因此您不必过度简化答案。

My challenge is that I have to analyze huge chunks of machine data (timeseries over two years; 24 structure-identical csv-files, each with 170 columns, ~ 2.5 million rows, ~ 2.6gb size). 我的挑战是我必须分析大量的机器数据(两年中的时间序列; 24个结构相同的csv文件,每个文件具有170列,〜250万行,〜2.6gb大小)。

This data has to be analyzed in regard to correlations. 必须就相关性分析该数据。 The initally desired output is an 170x170 correlation matrix. 最初期望的输出是170x170相关矩阵。 Further analysis (lag, an asymetrical correlation matrix Input x Output) shall be postponed to the next step and is not primarily to be considered for your answer. 进一步分析(滞后,不对称相关矩阵输入x输出)应推迟到下一步,并且主要不考虑您的答案。

I've been able to read one of the files into a dataframe (using the IPython-Console of Spyder; for the cost of a lot of my 16gb memory). 我已经能够将一个文件读入数据帧(使用Spyder的IPython-Console;这要花大量16gb内存)。

import pandas as pd

df = pd.read_csv(r"C:\MyFilePath\...\TestData.csv", sep=';', encoding='iso-8859-1')

In[]: len(df.columns)
Out[]: 170

In[]: len(df)
Out[]: 2678401

But from there on I'm stuck... 但是从那以后我被困住了...

The pandas.DataFrame.corr method does not work properly and returns (if it works) only a 10 x 10 Matrix with a lot of NaN values (which are in my understanding just a display for a non existent pearson correlation (close to or equal to zero)). pandas.DataFrame.corr方法无法正常运行,并且仅返回(如果有效)具有很多NaN值的10 x 10矩阵(据我理解,这只是显示不存在的皮尔逊相关性(接近或等于)归零))。

I have found several descriptions how to load data into my dataframe, which exceeds my RAM. 我发现了一些描述如何将数据加载到我的数据帧中,这超出了我的RAM。 Yet I was not able to fully understand the concept of loading chunks, especially in combination with my time series. 但是我无法完全理解加载块的概念,尤其是与时间序列结合使用时。

I would really appreciate, if you could provide me with a proper hint or snippet, so that I can solve this problem. 如果您能为我提供适当的提示或摘要,以便我可以解决此问题,我将不胜感激。

Ideally the result is, that I can run over all the csv-files and get the desired correlation matrix for all parameters. 理想的结果是,我可以在所有csv文件上运行并获得所有参数所需的相关矩阵。

Note: I am not bound to pandas. 注意:我不受熊猫的约束。 If you suggest another library which serves this problem in a better way, I'm happy to hear your solution. 如果您建议另一个可以更好地解决此问题的库,我们很高兴听到您的解决方案。 But due to the security policy of my company I am obliged to not download any additional software (or to be more precise: it is complicated...) The only other option I have at hand is MATLAB R2018.a 但是由于我公司的安全政策,我不得不不下载任何其他软件(或更确切地说:这很复杂...)我手头唯一的其他选择是MATLAB R2018.a

Pandas df.corr gives a correlation matrix NxN, where N is the number of columns. 熊猫df.corr给出一个相关矩阵NxN,其中N是列数。 I tried it with 200 columns and it works. 我尝试了200列,它可以工作。

The most likely reason is that your data are not clean. 最可能的原因是您的数据不干净。 If pandas finds a data point that is not acceptable for the correlation operation, it excluded that column. 如果pandas找到关联操作不可接受的数据点,则它将排除该列。 Try to create a dataframe with only numbers and just one string in one of the fields and you'll see what I mean. 尝试在其中一个字段中创建仅包含数字和仅一个字符串的数据框,您将明白我的意思。

If the data are not in a good state, it would explain why there are so many nans as well. 如果数据状态不佳,它将解释为什么还有那么多nan。 I think you have to do some cleaning and pre-processing on the data. 我认为您必须对数据进行一些清理和预处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM