简体   繁体   中英

Write to csv with Python Multiprocessing apply_async causes missing of data

I have a csv file, where I read urls line by line to make a request for each enpoint. Each request is parsed and data is written to the output.csv. This process is paralleled.

The issue is connected with written data. Some portions of data are partially missed, or totally missed (blank lines). I suppose that it is happening because of collision or conflicts between async processes. Can you please advice how to fix that.

def parse_data(url, line_num):
    print line_num, url
    r = requests.get(url)
    htmltext = r.text.encode("utf-8")
    pois = re.findall(re.compile('<pois>(.+?)</pois>'), htmltext)
    for poi in pois:
        write_data(poi)

def write_data(poi):
    with open('output.csv', 'ab') as resfile:
        writer = csv.writer(resfile)
        writer.writerow([poi])
    resfile.close()

def main():
    pool = Pool(processes=4)

    with open("input.csv", "rb") as f:
        reader = csv.reader(f)
        for line_num, line in enumerate(reader):
            url = line[0]
            pool.apply_async(parse_data, args=(url, line_num))

    pool.close()
    pool.join()

Try to add file locking:

import fcntl

def write_data(poi):
    with open('output.csv', 'ab') as resfile:
        writer = csv.writer(resfile)
        fcntl.flock(resfile, fcntl.LOCK_EX)
        writer.writerow([poi])
        fcntl.flock(resfile, fcntl.LOCK_UN)
     # Note that you dont have to close the file. The 'with' will take care of it

Concurrent writes to a same file is indeed a known cause of data loss / file corruption. The safe solution here is the "map / reduce" pattern - each process writes in it's own result file (map), then you concatenate those files together (reduce).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM