简体   繁体   中英

Divide the txt file into lists by symbol, python3

I have a txt file, ~1-10MB, like this:

"Type,int_data,ID,..., some data"

I want to separate it by ID. For example do the following:

list_1=[]
list_2=[]
list_3=[]
.. 
list_7=[]

with open(txt,'r', encoding='utf-8') as txt:                        
    for string in txt:
        string=string.rstrip().split(',')
        ID=int(string[2])
        if ID==1:
            list_1.append(string)
        elif ID==2:
            list_2.append(string)
            ..

But it's quite slow.. can it be better?

How about this? Might not be that faster, but give it a try and let me know!

from collections import defaultdict
res = defaultdict(list) #making a dict of lists where ID is the key
with open(txt,'r', encoding='utf-8') as txt:                        
    for string in txt:
        res[string.split(',')[2]].append(string) #appending the lines to the ID key

Here is code snippet I have used before on 100mb+ files (I am not author). Not sure if it helps on your file size or if all the overhead is too much. Basically how it works is first, break file into chunks of characters (chunkify), then for each chunk spawn a job that will read from start to end of that chunk. These jobs are then distributed to your threadpool so you can use all of your cores, while being efficient about how many times you have to send receive data from them.

For your case, just add a 'process' function for 'process_wrapper' to use for each line like what @Keerthi Bachu has.

This may work, or may give some inspiration.

import multiprocessing as mp,os

def process_wrapper(chunkStart, chunkSize):
    with open("input.txt") as f:
        f.seek(chunkStart)
        lines = f.read(chunkSize).splitlines()
        for line in lines:
            process(line)

def chunkify(fname,size=1024*1024):
    fileEnd = os.path.getsize(fname)
    with open(fname,'r') as f:
        chunkEnd = f.tell()
    while True:
        chunkStart = chunkEnd
        f.seek(size,1)
        f.readline()
        chunkEnd = f.tell()
        yield chunkStart, chunkEnd - chunkStart
        if chunkEnd > fileEnd:
            break

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
for chunkStart,chunkSize in chunkify("input.txt"):
    jobs.append( pool.apply_async(process_wrapper,(chunkStart,chunkSize)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM