繁体   English   中英

如何更改我的Python代码,以便它可以更有效地将txt文件转换为CSV?

[英]How could I change my Python code so that it could convert txt file to CSV more efficiently?

我最近学习了Python,并开始开发可以读取和清除数据的代码。 在下面的代码中,我试图读取每个200 MB的大约200个txt文件,每个文件带有| |。 定界符,并尝试将它们合并到一个CSV文件中,但有1处特定的更改。 源文件具有负数,其中负号位于数字的末尾。 例如221.36- 111-等。我需要将它们转换为-221.36和-111。

目前大约需要100分钟才能处理8000万条记录。 由于这只是我用Python编写的第二或第三代码,因此,我正在就如何优化此代码寻求您的意见。 在准备投入生产之前,您可能会建议的任何最佳实践都会有很大的帮助。

from tempfile import NamedTemporaryFile
import shutil
import csv
import glob

# List out all files that needs to be used as Input
list_of_input_files = (glob.glob("C:/Users/datafolder/pattern*"))
with open('C:/Users/datafolder/tempfile.txt','wb') as wfd:
     for f in list_of_input_files:
        with open(f,'rb') as fd:
            shutil.copyfileobj(fd, wfd, 1024*1024*10)
print('File Merge Complete')

# Create temporary files for processing
txt_file = "C:/Users/datafolder/tempfile.txt"
csv_file = "C:/Users/datafolder/mergedcsv.csv"

# Write CSV file after reading data from a txt file. Converts delimeter from '|' to ','
with open(txt_file,'r', encoding='utf-8') as file_pipe:
    with open(csv_file, 'w', encoding='utf-8', newline='') as file_comma: #newline paramater to ignore blank lines in the final file
        csv.writer(file_comma, delimiter=',').writerows(csv.reader(file_pipe, delimiter='|'))

print('CSV File Created.')

tempfile = NamedTemporaryFile(mode='w', encoding='utf-8', delete=False)

# Data Definition
fields = ['Field 1','Field 2','Field 3,'Field 4','Field 5','Field 6','Field 7','Field 8','Field 9','Field 10','Field 11','Field 12','Field 13','Field 14','Field 15','Field 16','Field 17,'Field 18','Field 19','Field 20']

count=0
# Open files in read and write modes for data processing
with open(csv_file, 'r', encoding='utf-8') as csvfile, tempfile:
    reader = csv.DictReader(csvfile, fieldnames=fields) #Using a Python dictionary to read and write data into a CSV file.
    writer = csv.DictWriter(tempfile, fieldnames=fields, lineterminator='\n')
    writer.writeheader()
    for row in reader:
       if count < 1000000:
          if row['Field 10'].endswith('-'):
             row['Field 10']=float(row['Field 10'].replace('-',''))*(-1) #Trims - sign from the end of line, converts the target field to Float and makes it negative
          count=count+1
       else:
          print('1 Million records Processed')
          count=0

         # Creating a row for final write
       row={'Field 1' : row['Field 1'],'Field 2' : row['Field 2'],'Field 3  : row['Field 3 ],'Field 4' : row['Field 4'],'Field 5' : row['Field 5'],'Field 6' : row['Field 6'],'Field 7' : row['Field 7'],'Field 8' : row['Field 8'],'Field 9' : row['Field 9'],'Field 10' : row['Field 10'],'Field 11' : row['Field 11'],'Field 12' : row['Field 12'],'Field 13' : row['Field 13'],'Field 14' : row['Field 14'],'Field 15' : row['Field 15'],'Field 16' : row['Field 16'],'Field 17' : row['Field 17'],'Field 18' : row['Field 18'],'Field 19' : row['Field 19'],'Field 20' : row['Field 20']}
       writer.writerow(row) # Writer write rows to the CSV File
    print('Data write to CSV file complete')

# Renaming the newly created temp file as the final file. New file now has fully processed data.            
shutil.move(tempfile.name, csv_file)
print('Renaming Complete')

使用csv.DictReader而不是csv.reader背后的原因是什么? 使用csv.reader将允许您通过索引访问行数据,而不是使用诸如'Field 10''Field 10'键:

if row[9].endswith('-'):
    row[9]=float(row[9].replace('-',''))*(-1)

这将消除对第51行的代码的需要,因为这将使程序writer.writerow(row) ,因为您可以使用已经拥有的row简单地调用writer.writerow(row) ,因为它已经是一个元组。

使用csv.reader还将提供另一个小的优化。 目前,您要在每次循环时检查count < 1000000是否count < 1000000 ,并且还要递增count变量。 相反,您可以执行以下操作:

row_count = sum(1 for row in reader)
if row_count >= 1000000:
    row_count = 1000000

for i in itertools.islice(reader, row_count):
    // trim logic

if row_count == 1000000:
    print('1 Million records Processed')

这样就消除了条件检查,并且消除了count变量的增量,该计数变量在8000万次迭代过程中可以总计节省一些实际时间。

无需过多地查看代码,我怀疑您可以通过使用生成器逐行读取文件而不是一次读取所有文件来减少内存占用。

它看起来像这样:

for line in open('really_big_file.csv'):
    do_something(line)

另外, csv.DictReader可能比其他读取文件的方法要慢,纯粹是因为它需要为每一行建立一个哈希表。 将结果读取到tuple可能会更快。

我建议尝试熊猫 ,它是read_csv()函数。 您可以告诉它在“ |”上分割 然后修改列值以修正负号。 然后告诉它使用to_csv()写出新的CSV文件。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM