简体   繁体   English

Python 加速 csv 操纵

[英]Python Speed up csv manipulation

Is there anyway to speed up the processing of this csv file manipulation?有没有办法加快处理这个 csv 文件操作? With a csv with 5000 entries it works fine but when there are 1,000,000+ entries it takes a long time.使用具有 5000 个条目的 csv 可以正常工作,但是当有 1,000,000+ 个条目时,它需要很长时间。

r1 = csv.reader(open('file1.csv'))
r2 = csv.reader(open('file2.csv'))
with open(file3, 'w', newline='') as wf:
    writer = csv.writer(wf)
    entries = []
    first = True

    for child, a, b, c, parent, d in r1:
        if not child and not parent:
            continue
        if first:
            first = False
            continue
        entries.append([parent, child])

    first = True

    for child, _, _, _, parent, _ in r2:
        if not child and not parent:
            continue
        if first:
            first = False
            continue

        entries.append([parent, child])

    for p, c in entries:
        for sp, sc in entries:
            if p == sc:
                break
        else:
            entries.append([p, p])


    writer.writerow(["parent_new", "child_new"])
    writer.writerows(entries)

Also there is a line break between the header and the first row of data, anyway to remove this blank line when writing to the new csv?在 header 和第一行数据之间还有一个换行符,无论如何要在写入新的 csv 时删除这个空白行?

Your loop:你的循环:

    for p, c in entries:
        for sp, sc in entries:
            if p == sc:
                break
        else:
            entries.append([p, p])

will be taking quadratic time.将花费二次时间。

All that it seems to be doing is writing the values of p which do not equal any of the child values.它似乎正在做的只是编写不等于child值的p值。 As these values derive from the CSV file so must be strings, and are therefore hashable, you could save them (or more specifically, the unique values) in a set:由于这些值来自 CSV 文件,因此必须是字符串,因此是可散列的,您可以将它们(或更具体地说,唯一值)保存在一组中:

children = set(child for parent, child in entries)

It costs some more memory, but then you can do它花费更多 memory,但你可以这样做

    for p, c in entries:
        if p not in children:
            entries.append([p, p])

so this should then be linear time rather than quadratic (because set inclusion testing is essentially constant time).所以这应该是线性时间而不是二次时间(因为集合包含测试基本上是恒定时间)。


On a more minor point, to remove the first row of each of the input files, instead of using your first variable (which you then have to test for on every iteration), simply call next(r1) before entering the loop (and discard the value) -- and similarly for r2 .在更小的一点上,要删除每个输入文件的第一行,而不是使用您的first变量(然后您必须在每次迭代中对其进行测试),只需在进入循环之前调用next(r1) (并丢弃值)——对于r2也是如此。 That said, do not expect huge gain from doing this, because this is in the linear-time part of the code.也就是说,不要期望这样做会获得巨大的收益,因为这是代码的线性时间部分。 It is the O(n^2) bit mentioned above that is really important.真正重要的是上面提到的 O(n^2) 位。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM