简体   繁体   中英

Python list from huge text file

I need to import into a Pandas Datafreame a text file with 1M+ lines and then split it by position. Here is an exaple of what the data looks like:

Corno        616600000071835014010820220108202201201082022100000054700000020000000000            0000000000192           010 150   0100090076002000220000004038+00000000000000002934+EUR00000000000293+EUR00000000003227+EUR000000000000            KG     

What I'm doing right now is to decode the line and then splitting it using the pythons list:

with open('temp.txt', 'r') as f:
    for line in f:    
        dec = line.decode("cp1252")
        temp = [dec[0:6], dec[6:12], dec[12:14], dec[14:22], dec[22:30], dec[30:38], dec[38:39],
                dec[39:41], dec[41:49], dec[49:52], dec[52:60], dec[60:62], dec[62:70], dec[70:82],
                dec[82:84], dec[84:92], dec[92:104], dec[104:108], dec[108:126], dec[126:144],
                dec[144:152], dec[152:153], dec[153:213], negozio]
        data_t.append(temp)

and then creating a dataframe with the list:

df_total_testate = pd.DataFrame(data_t,
                            columns=['Column1', 'Column2', 'Column3', 'Column4',
                                     'Column5', 'Column6', 'Column7',
                                     'Column8', 'Column9', 'Column10',
                                     'Column11', 'Column12', 'Column13',
                                     'Column14', 'Column15',
                                     'Column16', 'Column17',
                                     'Column18', 'Column19',
                                     'Column20', 'Column21',
                                     'Column22', 'Column23', 'Column24'])

But this way it takes over 20 mins.

Is there any way to make it faster?

EDIT: I updated the code for more details

You should use pandas builtin functions as they are highly optimised. Also, you should avoid pd.DataFrame.append . As mentioned in the doc :

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

My advice is thus: either load the data with pd.read_csv(filename, sep="\s+", names=[f"Column{i}" for i in range(1, 25)]) . You might also try other delimiters instead of \s+ or pd.read_fwf as mentioned by Алексей Р in the comments.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM