使用Python / Pandas从大型csv文件获取相关矩阵时出现问题

Question

First of all: I'm a beginner with python and data analytics BUT I'm confident I understand the concepts enough so you don't have to over-simplify your answers. 首先：我是python和数据分析的初学者，但我有信心我对这些概念已经足够了解，因此您不必过度简化答案。

My challenge is that I have to analyze huge chunks of machine data (timeseries over two years; 24 structure-identical csv-files, each with 170 columns, ~ 2.5 million rows, ~ 2.6gb size). 我的挑战是我必须分析大量的机器数据（两年中的时间序列； 24个结构相同的csv文件，每个文件具有170列，〜250万行，〜2.6gb大小）。

This data has to be analyzed in regard to correlations. 必须就相关性分析该数据。 The initally desired output is an 170x170 correlation matrix. 最初期望的输出是170x170相关矩阵。 Further analysis (lag, an asymetrical correlation matrix Input x Output) shall be postponed to the next step and is not primarily to be considered for your answer. 进一步分析（滞后，不对称相关矩阵输入x输出）应推迟到下一步，并且主要不考虑您的答案。

I've been able to read one of the files into a dataframe (using the IPython-Console of Spyder; for the cost of a lot of my 16gb memory). 我已经能够将一个文件读入数据帧（使用Spyder的IPython-Console；这要花大量16gb内存）。

import pandas as pd

df = pd.read_csv(r"C:\MyFilePath\...\TestData.csv", sep=';', encoding='iso-8859-1')

In[]: len(df.columns)
Out[]: 170

In[]: len(df)
Out[]: 2678401

But from there on I'm stuck... 但是从那以后我被困住了...

The pandas.DataFrame.corr method does not work properly and returns (if it works) only a 10 x 10 Matrix with a lot of NaN values (which are in my understanding just a display for a non existent pearson correlation (close to or equal to zero)). pandas.DataFrame.corr方法无法正常运行，并且仅返回（如果有效）具有很多NaN值的10 x 10矩阵（据我理解，这只是显示不存在的皮尔逊相关性（接近或等于）归零））。

I have found several descriptions how to load data into my dataframe, which exceeds my RAM. 我发现了一些描述如何将数据加载到我的数据帧中，这超出了我的RAM。 Yet I was not able to fully understand the concept of loading chunks, especially in combination with my time series. 但是我无法完全理解加载块的概念，尤其是与时间序列结合使用时。

I would really appreciate, if you could provide me with a proper hint or snippet, so that I can solve this problem. 如果您能为我提供适当的提示或摘要，以便我可以解决此问题，我将不胜感激。

Ideally the result is, that I can run over all the csv-files and get the desired correlation matrix for all parameters. 理想的结果是，我可以在所有csv文件上运行并获得所有参数所需的相关矩阵。

Note: I am not bound to pandas. 注意：我不受熊猫的约束。 If you suggest another library which serves this problem in a better way, I'm happy to hear your solution. 如果您建议另一个可以更好地解决此问题的库，我们很高兴听到您的解决方案。 But due to the security policy of my company I am obliged to not download any additional software (or to be more precise: it is complicated...) The only other option I have at hand is MATLAB R2018.a 但是由于我公司的安全政策，我不得不不下载任何其他软件（或更确切地说：这很复杂...）我手头唯一的其他选择是MATLAB R2018.a

Answer 1

Pandas df.corr gives a correlation matrix NxN, where N is the number of columns. 熊猫df.corr给出一个相关矩阵NxN，其中N是列数。 I tried it with 200 columns and it works. 我尝试了200列，它可以工作。

The most likely reason is that your data are not clean. 最可能的原因是您的数据不干净。 If pandas finds a data point that is not acceptable for the correlation operation, it excluded that column. 如果pandas找到关联操作不可接受的数据点，则它将排除该列。 Try to create a dataframe with only numbers and just one string in one of the fields and you'll see what I mean. 尝试在其中一个字段中创建仅包含数字和仅一个字符串的数据框，您将明白我的意思。

If the data are not in a good state, it would explain why there are so many nans as well. 如果数据状态不佳，它将解释为什么还有那么多nan。 I think you have to do some cleaning and pre-processing on the data. 我认为您必须对数据进行一些清理和预处理。

使用Python / Pandas从大型csv文件获取相关矩阵时出现问题

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-02-15 16:49:09

使用Python / Pandas从大型csv文件获取相关矩阵时出现问题

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-02-15 16:49:09

解决方案1
0 已采纳 2019-02-15 16:49:09