Write to csv with Python Multiprocessing apply_async causes missing of data

Question

I have a csv file, where I read urls line by line to make a request for each enpoint. Each request is parsed and data is written to the output.csv. This process is paralleled.

The issue is connected with written data. Some portions of data are partially missed, or totally missed (blank lines). I suppose that it is happening because of collision or conflicts between async processes. Can you please advice how to fix that.

def parse_data(url, line_num):
    print line_num, url
    r = requests.get(url)
    htmltext = r.text.encode("utf-8")
    pois = re.findall(re.compile('<pois>(.+?)</pois>'), htmltext)
    for poi in pois:
        write_data(poi)

def write_data(poi):
    with open('output.csv', 'ab') as resfile:
        writer = csv.writer(resfile)
        writer.writerow([poi])
    resfile.close()

def main():
    pool = Pool(processes=4)

    with open("input.csv", "rb") as f:
        reader = csv.reader(f)
        for line_num, line in enumerate(reader):
            url = line[0]
            pool.apply_async(parse_data, args=(url, line_num))

    pool.close()
    pool.join()

Answer 1

Try to add file locking:

import fcntl

def write_data(poi):
    with open('output.csv', 'ab') as resfile:
        writer = csv.writer(resfile)
        fcntl.flock(resfile, fcntl.LOCK_EX)
        writer.writerow([poi])
        fcntl.flock(resfile, fcntl.LOCK_UN)
     # Note that you dont have to close the file. The 'with' will take care of it

Answer 2

Concurrent writes to a same file is indeed a known cause of data loss / file corruption. The safe solution here is the "map / reduce" pattern - each process writes in it's own result file (map), then you concatenate those files together (reduce).

Write to csv with Python Multiprocessing apply_async causes missing of data

Question

2 answers

solution1
0 2019-02-19 14:32:40

solution2
0 2019-02-19 14:45:10

Write to csv with Python Multiprocessing apply_async causes missing of data

Question

2 answers

solution1 0 2019-02-19 14:32:40

solution2 0 2019-02-19 14:45:10

solution1
0 2019-02-19 14:32:40

solution2
0 2019-02-19 14:45:10