使用 python pandas 对大型 csv 文件的汇总统计

Question

Lets say i have 10gb of csv file and i want to get the summary statistics of the file using DataFrame describe method.假设我有 10gb 的 csv 文件，我想使用 DataFrame 描述方法获取文件的摘要统计信息。

In this case first i need to create a DataFrame for all the 10gb csv data.在这种情况下，首先我需要为所有 10gb csv 数据创建一个 DataFrame。

text_csv=Pandas.read_csv("target.csv")
df=Pandas.DataFrame(text_csv)
df.describe()

Does this mean all the 10gb will get loaded in to memory and calculate the statistics?这是否意味着所有 10gb 都将被加载到内存中并计算统计信息？

Answer 1

Yes, I think you are right.是的，我认为你是对的。 And you can omit df=Pandas.DataFrame(text_csv) , because output from read_csv is DataFrame :你可以省略df=Pandas.DataFrame(text_csv) ，因为read_csv输出是DataFrame ：

import pandas as pd

df = pd.read_csv("target.csv")
print df.describe()

Or you can use dask :或者你可以使用dask ：

import dask.dataframe as dd

df = dd.read_csv('target.csv.csv')

print df.describe()

You can use parameter chunksize of read_csv , but you get output TextParser not DataFrame , so then you need concat :您可以使用read_csv参数chunksize ，但您会得到输出TextParser而不是DataFrame ，因此您需要concat ：

import pandas as pd
import io

temp=u"""a;b
1;525
1;526
1;533
2;527
2;528
2;532
3;519
3;534
3;535
4;530
5;529
5;531
6;520
6;521
6;524"""
#after testing replace io.StringIO(temp) to filename
#chunksize = 2 for testing
tp = pd.read_csv(io.StringIO(temp), sep=";", chunksize=2)
print tp
<pandas.io.parsers.TextFileReader object at 0x000000001995ADA0>
df = pd.concat(tp, ignore_index=True)
print df.describe()
               a           b
count  15.000000   15.000000
mean    3.333333  527.600000
std     1.877181    5.082182
min     1.000000  519.000000
25%     2.000000  524.500000
50%     3.000000  528.000000
75%     5.000000  531.500000
max     6.000000  535.000000

You can convert TextFileReader to DataFrame , but aggregate this output can be difficult:您可以将TextFileReader转换为DataFrame ，但聚合此输出可能很困难：

import pandas as pd

import io
temp=u"""a;b
1;525
1;526
1;533
2;527
2;528
2;532
3;519
3;534
3;535
4;530
5;529
5;531
6;520
6;521
6;524"""

#after testing replace io.StringIO(temp) to filename
tp = pd.read_csv(io.StringIO(temp), sep=";", chunksize=2)
print tp

dfs = []
for t in tp:
    df = pd.DataFrame(t)
    df1 = df.describe()
    dfs.append(df1.T)

df2 = pd.concat(dfs)

print df2
   count   mean        std  min     25%    50%     75%  max
a      2    1.0   0.000000    1    1.00    1.0    1.00    1
b      2  525.5   0.707107  525  525.25  525.5  525.75  526
a      2    1.5   0.707107    1    1.25    1.5    1.75    2
b      2  530.0   4.242641  527  528.50  530.0  531.50  533
a      2    2.0   0.000000    2    2.00    2.0    2.00    2
b      2  530.0   2.828427  528  529.00  530.0  531.00  532
a      2    3.0   0.000000    3    3.00    3.0    3.00    3
b      2  526.5  10.606602  519  522.75  526.5  530.25  534
a      2    3.5   0.707107    3    3.25    3.5    3.75    4
b      2  532.5   3.535534  530  531.25  532.5  533.75  535
a      2    5.0   0.000000    5    5.00    5.0    5.00    5
b      2  530.0   1.414214  529  529.50  530.0  530.50  531
a      2    6.0   0.000000    6    6.00    6.0    6.00    6
b      2  520.5   0.707107  520  520.25  520.5  520.75  521
a      1    6.0        NaN    6    6.00    6.0    6.00    6
b      1  524.0        NaN  524  524.00  524.0  524.00  524

Answer 2

Seems there is no limitation of file size for pandas.read_csv method. pandas.read_csv方法似乎没有文件大小限制。

According to @fickludd's and @Sebastian Raschka's answer in Large, persistent DataFrame in pandas , you can use iterator=True and chunksize=xxx to load the giant csv file and calculate the statistics you want:根据 @fickludd 和 @Sebastian Raschka 在大熊猫持久数据帧中的回答，您可以使用iterator=True和chunksize=xxx加载巨型 csv 文件并计算您想要的统计数据：

import pandas as pd

df = pd.read_csv('some_data.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iteratable with chunks of 1000 rows.
partial_desc = df.describe()

And aggregate all the partial describe info all yourself.并自己汇总所有部分描述信息。

使用 python pandas 对大型 csv 文件的汇总统计

问题描述

2 个解决方案

解决方案1
4 2016-02-23 06:46:26

解决方案2
1 2016-02-23 07:31:30

使用 python pandas 对大型 csv 文件的汇总统计

问题描述

2 个解决方案

解决方案1 4 2016-02-23 06:46:26

解决方案2 1 2016-02-23 07:31:30

解决方案1
4 2016-02-23 06:46:26

解决方案2
1 2016-02-23 07:31:30