[英]How could I change my Python code so that it could convert txt file to CSV more efficiently?
我最近學習了Python,並開始開發可以讀取和清除數據的代碼。 在下面的代碼中,我試圖讀取每個200 MB的大約200個txt文件,每個文件帶有| |。 定界符,並嘗試將它們合並到一個CSV文件中,但有1處特定的更改。 源文件具有負數,其中負號位於數字的末尾。 例如221.36- 111-等。我需要將它們轉換為-221.36和-111。
目前大約需要100分鍾才能處理8000萬條記錄。 由於這只是我用Python編寫的第二或第三代碼,因此,我正在就如何優化此代碼尋求您的意見。 在准備投入生產之前,您可能會建議的任何最佳實踐都會有很大的幫助。
from tempfile import NamedTemporaryFile
import shutil
import csv
import glob
# List out all files that needs to be used as Input
list_of_input_files = (glob.glob("C:/Users/datafolder/pattern*"))
with open('C:/Users/datafolder/tempfile.txt','wb') as wfd:
for f in list_of_input_files:
with open(f,'rb') as fd:
shutil.copyfileobj(fd, wfd, 1024*1024*10)
print('File Merge Complete')
# Create temporary files for processing
txt_file = "C:/Users/datafolder/tempfile.txt"
csv_file = "C:/Users/datafolder/mergedcsv.csv"
# Write CSV file after reading data from a txt file. Converts delimeter from '|' to ','
with open(txt_file,'r', encoding='utf-8') as file_pipe:
with open(csv_file, 'w', encoding='utf-8', newline='') as file_comma: #newline paramater to ignore blank lines in the final file
csv.writer(file_comma, delimiter=',').writerows(csv.reader(file_pipe, delimiter='|'))
print('CSV File Created.')
tempfile = NamedTemporaryFile(mode='w', encoding='utf-8', delete=False)
# Data Definition
fields = ['Field 1','Field 2','Field 3,'Field 4','Field 5','Field 6','Field 7','Field 8','Field 9','Field 10','Field 11','Field 12','Field 13','Field 14','Field 15','Field 16','Field 17,'Field 18','Field 19','Field 20']
count=0
# Open files in read and write modes for data processing
with open(csv_file, 'r', encoding='utf-8') as csvfile, tempfile:
reader = csv.DictReader(csvfile, fieldnames=fields) #Using a Python dictionary to read and write data into a CSV file.
writer = csv.DictWriter(tempfile, fieldnames=fields, lineterminator='\n')
writer.writeheader()
for row in reader:
if count < 1000000:
if row['Field 10'].endswith('-'):
row['Field 10']=float(row['Field 10'].replace('-',''))*(-1) #Trims - sign from the end of line, converts the target field to Float and makes it negative
count=count+1
else:
print('1 Million records Processed')
count=0
# Creating a row for final write
row={'Field 1' : row['Field 1'],'Field 2' : row['Field 2'],'Field 3 : row['Field 3 ],'Field 4' : row['Field 4'],'Field 5' : row['Field 5'],'Field 6' : row['Field 6'],'Field 7' : row['Field 7'],'Field 8' : row['Field 8'],'Field 9' : row['Field 9'],'Field 10' : row['Field 10'],'Field 11' : row['Field 11'],'Field 12' : row['Field 12'],'Field 13' : row['Field 13'],'Field 14' : row['Field 14'],'Field 15' : row['Field 15'],'Field 16' : row['Field 16'],'Field 17' : row['Field 17'],'Field 18' : row['Field 18'],'Field 19' : row['Field 19'],'Field 20' : row['Field 20']}
writer.writerow(row) # Writer write rows to the CSV File
print('Data write to CSV file complete')
# Renaming the newly created temp file as the final file. New file now has fully processed data.
shutil.move(tempfile.name, csv_file)
print('Renaming Complete')
使用csv.DictReader
而不是csv.reader
背后的原因是什么? 使用csv.reader
將允許您通過索引訪問行數據,而不是使用諸如'Field 10'
類'Field 10'
鍵:
if row[9].endswith('-'):
row[9]=float(row[9].replace('-',''))*(-1)
這將消除對第51行的代碼的需要,因為這將使程序writer.writerow(row)
,因為您可以使用已經擁有的row
簡單地調用writer.writerow(row)
,因為它已經是一個元組。
使用csv.reader
還將提供另一個小的優化。 目前,您要在每次循環時檢查count < 1000000
是否count < 1000000
,並且還要遞增count
變量。 相反,您可以執行以下操作:
row_count = sum(1 for row in reader)
if row_count >= 1000000:
row_count = 1000000
for i in itertools.islice(reader, row_count):
// trim logic
if row_count == 1000000:
print('1 Million records Processed')
這樣就消除了條件檢查,並且消除了count變量的增量,該計數變量在8000萬次迭代過程中可以總計節省一些實際時間。
無需過多地查看代碼,我懷疑您可以通過使用生成器逐行讀取文件而不是一次讀取所有文件來減少內存占用。
它看起來像這樣:
for line in open('really_big_file.csv'):
do_something(line)
另外, csv.DictReader
可能比其他讀取文件的方法要慢,純粹是因為它需要為每一行建立一個哈希表。 將結果讀取到tuple
可能會更快。
我建議嘗試熊貓 ,它是read_csv()
函數。 您可以告訴它在“ |”上分割 然后修改列值以修正負號。 然后告訴它使用to_csv()
寫出新的CSV文件。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.