简体   繁体   English

在python中读取csv的最快方法,处理每一行,并编写一个新的csv

[英]fastest way in python to read csv, process each line, and write a new csv

What is the fastest way to process each line of a csv and write to a new csv ? 处理csv的每一行并写入新的csv的最快方法是什么? Is there a way to use the least memory as well as be the fastest? 有没有办法使用最少的内存,也是最快的? Please see the following code. 请参阅以下代码。 It requests a csv from an API but it takes very long to go through the for loop I commented. 它从API请求一个csv,但是经过我评论的for循环需要很长时间。 Also I think it is using all the memory on my server. 另外我认为它正在使用我服务器上的所有内存。

from pandas import *
import csv
import requests

reportResult = requests.get(api,headers=header)
csvReader = csv.reader(utf_8_encoder(reportResult.text))
reportData = []
#for loop takes a long time
for row in csvReader:
  combinedDict  = dict(zip(fields, row))
  combinedDict = cleanDict(combinedDict)
  reportData.append(combinedDict)
reportDF = DataFrame(reportData, columns = fields)
reportDF.to_csv('report.csv',sep=',',header=False,index=False)



def utf_8_encoder(unicode_csv_data):
  for line in unicode_csv_data:
    yield line.encode('utf-8')



def cleanDict(combinedDict):
  if combinedDict.get('a_id', None) is not None:
    combinedDict['a_id'] = int(
        float(combinedDict['a_id']))
    combinedDict['unique_a_id'] = ('1_a_'+
           str(combinedDict['a_id']))
  if combinedDict.get('i_id', None) is not None:
    combinedDict['i_id'] =int(
        float(combinedDict['i_id']))
    combinedDict['unique_i_id'] = ('1_i_'+
         str(combinedDict['i_id']))
 if combinedDict.get('pm', None) is not None:
    combinedDict['pm'] = "{0:.10f}".format(float(combinedDict['pm']))
  if combinedDict.get('s', None) is not None:
    combinedDict['s'] = "{0:.10f}".format(float(combinedDict['s']))
  return combinedDict 

When I run the python memory profiler , why is the line on the for loop having memory increment? 当我运行python内存分析器时,为什么for循环中的行有内存增量? Is the actual for loop saving something in memory, or is my utf-8 convertor messing something up? 实际的for循环是在内存中保存的东西,还是我的utf-8转换器弄乱了什么?

Line #    Mem usage    Increment   Line Contents
================================================
   162 1869.254 MiB 1205.824 MiB     for row in csvReader:
   163                                 #print row
   164 1869.254 MiB    0.000 MiB       combinedDict  = dict(zip(fields, row))

When I put the "@profile" symbol on the utf_8-encoder function as well, I see the memory on the above for loop disappeared: 当我在utf_8编码器函数上放置“@profile”符号时,我看到上面for循环的内存消失了:

   163                               for row in csvReader:

But now there is memory on the convertor's for loop (i didn't let it run as long as last time so it only got to 56MB before I did ctrl+C): 但是现在转换器的for循环中有内存(我没有让它像上次一样运行,因此在我执行ctrl + C之前它只有56MB):

Line #    Mem usage    Increment   Line Contents
================================================
   154  663.430 MiB    0.000 MiB   @profile
   155                             def utf_8_encoder(unicode_csv_data):
   156  722.496 MiB   59.066 MiB     for line in unicode_csv_data:
   157  722.496 MiB    0.000 MiB       yield line.encode('utf-8')

I found it to be much faster and not using so much memory my server crashes to use dataframes to read the csv: 我发现它更快,并没有使用如此多的内存我的服务器崩溃使用数据帧来读取csv:

from cStringIO import StringIO
from pandas import *

reportText = StringIO(reportResult.text)
reportDF = read_csv(reportText, sep=',',parse_dates=False)

Then I am able to process it using apply, for example: 然后我可以使用apply处理它,例如:

def trimFloat(fl):
    if fl is not None:
      res = "{0:.10f}".format(float(fl))
      return res
    else:
      return None

floatCols  = ['a', 'b ']
for col in floatCols:
    reportDF[col] = reportDF[col].apply(trimFloat)


def removePct(reportDF):
  reportDF['c'] = reportDF['c'].apply(lambda x: x.translate(None, '%'))
  return reportDF

I suspect the major issue with the previous attempt had something to do with the UTF8 encoder 我怀疑之前尝试的主要问题与UTF8编码器有关

For starters, for should use izip from itertools. 对于初学者来说,应该使用来自itertools的izip。 See below. 见下文。

from itertools import izip

reportData = []
for row in csvReader:
    combinedDict  = dict(izip(fields, row))
    combinedDict = cleanDict(combinedDict) #cleaned dict method is probably where the bottle neck is
    reportData.append(combinedDict)

in izip is a generator version of zip, which it has a lower memory impact. 在izip中是zip的生成器版本,它具有较低的内存影响。 Though you probably won't have much of a gain since it looks like you're zipping one item at a time. 虽然你可能不会有太大的收获,因为看起来你一次只拉一个项目。 I would take a look at your cleanDict() function. 我会看看你的cleanDict()函数。 It has tons of if statements to evaluate and as such it takes time. 它有大量的if语句要评估,因此需要时间。 Lastly, if you are really pressed for more speed and can't figure out where to get it from, check using the 最后,如果你真的要求更高的速度,并且无法弄清楚从哪里获得它,请检查使用

from concurrent.futures import ProcessPoolExecutor

or in other words take a look at parallel processing. 或换句话说,看看并行处理。 https://docs.python.org/3/library/concurrent.futures.html https://docs.python.org/3/library/concurrent.futures.html

Also please take a look at the PEP 8 guidelines for python. 另请参阅PEP 8 for python指南。 https://www.python.org/dev/peps/pep-0008/ Your indentations are wrong. https://www.python.org/dev/peps/pep-0008/您的缩进是错误的。 All indentations should be 4 spaces. 所有压痕应为4个空格。 If nothing else it helps with readability. 如果没有别的,它有助于提高可读性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM