I have to read large .csv
of around 20MB
. Those files are tables composed by 8
columns and 5198
rows. I have to do some statistics over a specific column I
.
I have n
different files and this what I am doing:
stat = np.arange(n)
I = 0
for k in stat:
df = pd.read_csv(pathS+'run_TestRandom_%d.csv'%k, sep=' ')
I+=df['I']
I = I/k ## Average
This process takes 0.65s
and I wondering if there is a fastest way.
EDIT: Apparently this is a really bad way to do it! Don't do what I did I guess :/
I'm working on a similar problem right now with about the same size dataset. The method I'm using is numpy's genfromtxt
import numpy as np
ary2d = np.genfromtxt('yourfile.csv', delimiter=',', skip_header=1,
skip_footer=0, names=['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'])
On my system it times to about .1sec in total
The one problem with this is that any value that is non-numeric is simply replaced by nan
which may not be what you want
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.