简体   繁体   中英

Python: efficient way to create new csv from large dataset

I have a script that removes "bad elements" from a master list of elements, then returns a csv with the updated elements and their associated values.

My question, is whether there is a more efficient way to perform the same operation in the for loop?

Master=pd.read_csv('some.csv', sep=',',header=0,error_bad_lines=False)

MasterList = Master['Elem'].tolist()
MasterListStrain1 = Master['Max_Principal_Strain'].tolist()

#this file should contain elements that are slated for deletion
BadElem=pd.read_csv('delete_me_elements_column.csv', sep=',',header=None, error_bad_lines=False)
BadElemList = BadElem[0].tolist() 

NewMasterList = (list(set(MasterList) - set(BadElemList)))

filename = 'NewOutput.csv'
outfile = open(filename,'w')

#pdb.set_trace()


for i,j in enumerate(NewMasterList):
    #pdb.set_trace()
    Elem_Loc = MasterList.index(j)
    line ='\n%s,%.25f'%(j,MasterListStrain1[Elem_Loc])
    outfile.write(line)  


print ("\n The new output file will be named: " + filename)


outfile.close()

Stage 1

If you necessarily want to iterate in the for loop then besides using pd.to_csv which likely to improve performance you can do the following:

...
SetBadElem = set(BadElemList)
...
for i,Elem_Loc in enumerate(MasterList):
    if Elem_Loc not in SetBadElem:
        line ='\n%s,%.25f'%(j,MasterListStrain1[Elem_Loc])
        outfile.write(line)  

Jumping around the index is never efficient whereas iteration with skipping will give you much better performance (checking presence in a set is log n operation so it is relatively quick).

Stage 2 Using Pandas properly

...
SetBadElem = set(BadElemList)
...
for Elem in Master:
    if Elem not in SetBadElem:
        line ='\n%s,%.25f'%(Elem['elem'], Elem['Max_Principal_Strain'])
        outfile.write(line)  

There is no need to create lists out of pandas dataframe columns. Using the whole dataframe (and indexing into it) is a much better approach.

Stage 3 Removing messy iterated formatting operations

We can add a column ('Formatted') that will contain formatted data. For that we will create a lambda function :

formatter = lambda row: '\n%s,%.25f'%(row['elem'], row['Max_Principal_Strain'])

Master['Formatted'] = Master.apply(formatter)

Stage 4 Pandas-way filtering and output

We can format the dataframe in two ways. My preference is to reuse the formatting function:

import numpy as np
formatter = lambda row: '\n%s,%.25f'%(row['elem'], row['Max_Principal_Strain']) if row not in SetBadElem else np.nan

Now we can use the built-in dropna which drops all rows that have any NaN values

Master.dropna()  
Master.to_csv(filename)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM