[英]Summary statistics on Large csv file using python pandas
Lets say i have 10gb of csv file and i want to get the summary statistics of the file using DataFrame describe method.假设我有 10gb 的 csv 文件,我想使用 DataFrame 描述方法获取文件的摘要统计信息。
In this case first i need to create a DataFrame for all the 10gb csv data.在这种情况下,首先我需要为所有 10gb csv 数据创建一个 DataFrame。
text_csv=Pandas.read_csv("target.csv")
df=Pandas.DataFrame(text_csv)
df.describe()
Does this mean all the 10gb will get loaded in to memory and calculate the statistics?这是否意味着所有 10gb 都将被加载到内存中并计算统计信息?
Yes, I think you are right.是的,我认为你是对的。 And you can omit
df=Pandas.DataFrame(text_csv)
, because output from read_csv
is DataFrame
:你可以省略
df=Pandas.DataFrame(text_csv)
,因为read_csv
输出是DataFrame
:
import pandas as pd
df = pd.read_csv("target.csv")
print df.describe()
Or you can use dask :或者你可以使用dask :
import dask.dataframe as dd
df = dd.read_csv('target.csv.csv')
print df.describe()
You can use parameter chunksize
of read_csv
, but you get output TextParser
not DataFrame
, so then you need concat
:您可以使用
read_csv
参数chunksize
,但您会得到输出TextParser
而不是DataFrame
,因此您需要concat
:
import pandas as pd
import io
temp=u"""a;b
1;525
1;526
1;533
2;527
2;528
2;532
3;519
3;534
3;535
4;530
5;529
5;531
6;520
6;521
6;524"""
#after testing replace io.StringIO(temp) to filename
#chunksize = 2 for testing
tp = pd.read_csv(io.StringIO(temp), sep=";", chunksize=2)
print tp
<pandas.io.parsers.TextFileReader object at 0x000000001995ADA0>
df = pd.concat(tp, ignore_index=True)
print df.describe()
a b
count 15.000000 15.000000
mean 3.333333 527.600000
std 1.877181 5.082182
min 1.000000 519.000000
25% 2.000000 524.500000
50% 3.000000 528.000000
75% 5.000000 531.500000
max 6.000000 535.000000
You can convert TextFileReader
to DataFrame
, but aggregate this output can be difficult:您可以将
TextFileReader
转换为DataFrame
,但聚合此输出可能很困难:
import pandas as pd
import io
temp=u"""a;b
1;525
1;526
1;533
2;527
2;528
2;532
3;519
3;534
3;535
4;530
5;529
5;531
6;520
6;521
6;524"""
#after testing replace io.StringIO(temp) to filename
tp = pd.read_csv(io.StringIO(temp), sep=";", chunksize=2)
print tp
dfs = []
for t in tp:
df = pd.DataFrame(t)
df1 = df.describe()
dfs.append(df1.T)
df2 = pd.concat(dfs)
print df2
count mean std min 25% 50% 75% max
a 2 1.0 0.000000 1 1.00 1.0 1.00 1
b 2 525.5 0.707107 525 525.25 525.5 525.75 526
a 2 1.5 0.707107 1 1.25 1.5 1.75 2
b 2 530.0 4.242641 527 528.50 530.0 531.50 533
a 2 2.0 0.000000 2 2.00 2.0 2.00 2
b 2 530.0 2.828427 528 529.00 530.0 531.00 532
a 2 3.0 0.000000 3 3.00 3.0 3.00 3
b 2 526.5 10.606602 519 522.75 526.5 530.25 534
a 2 3.5 0.707107 3 3.25 3.5 3.75 4
b 2 532.5 3.535534 530 531.25 532.5 533.75 535
a 2 5.0 0.000000 5 5.00 5.0 5.00 5
b 2 530.0 1.414214 529 529.50 530.0 530.50 531
a 2 6.0 0.000000 6 6.00 6.0 6.00 6
b 2 520.5 0.707107 520 520.25 520.5 520.75 521
a 1 6.0 NaN 6 6.00 6.0 6.00 6
b 1 524.0 NaN 524 524.00 524.0 524.00 524
Seems there is no limitation of file size for pandas.read_csv
method. pandas.read_csv
方法似乎没有文件大小限制。
According to @fickludd's and @Sebastian Raschka's answer in Large, persistent DataFrame in pandas , you can use iterator=True
and chunksize=xxx
to load the giant csv file and calculate the statistics you want:根据 @fickludd 和 @Sebastian Raschka 在大熊猫持久数据帧中的回答,您可以使用
iterator=True
和chunksize=xxx
加载巨型 csv 文件并计算您想要的统计数据:
import pandas as pd
df = pd.read_csv('some_data.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iteratable with chunks of 1000 rows.
partial_desc = df.describe()
And aggregate all the partial describe info all yourself.并自己汇总所有部分描述信息。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.