如何优化代码以更快地处理？

Question

I have some performance issues with the code that I wrote. 我编写的代码存在一些性能问题。 The objective of the code is to compare 2 csv files (with over 900k rows in one, and 50k ~ 80k rows in the other). 该代码的目的是比较2个csv文件（一个中有900k行，另一个中有50k〜80k行）。

The goal is, to compare csv1 and csv2, and write matching data to the 3rd csv. 目的是比较csv1和csv2，并将匹配的数据写入第三个csv。

The data I have look like this: 我的数据如下所示：

CSV1: CSV1：

address,name,order_no
add1,John,d009
add2,Smith,d019
add3,Mary,d890
.....(900k more rows)

CSV2: CSV2：

address,hub_id
add3,x345
add4,x310
add1,a109
....(50k ~ 80k more rows)

The expected output: 预期输出：

CSV3: CSV3：

order_no,hub_id
d890,x345
d009,a109
.....(etc)

The code I'm working on right now (albeit simple) actually works. 我现在正在处理的代码（尽管很简单）实际上有效。 But, the whole process of comparing and writing takes a very long time to finish. 但是，比较和编写的整个过程需要很长时间才能完成。

Any pointer will be very appreciated. 任何指针将不胜感激。 I might have overlooked some python function that could be used in the case of comparing large data, since I just started learning. 自从我刚开始学习以来，我可能已经忽略了一些在比较大型数据时可以使用的python函数。

import csv
import time
start_time = time.time()

with open('csv1.csv', newline='', encoding='Latin-1') as masterfile:
    reader = csv.DictReader(masterfile)
    for row in reader:
        with open('csv2.csv', newline='', encoding='Latin-1') as list1:
            reader2 = csv.DictReader(list1)
            for row2 in reader2:
                if row2['address'] == row['address']:
                     with open('csv3.csv', 'a') as corder:
                     print(row2['wip'] + ', ' + row['lat'] + ', ' + row['long'], file=corder)

print("--- %s seconds ---" % (time.time() - start_time))

Answer 1

What your algorithm is currently doing: 您的算法当前正在做什么：

Load a row of the big file. 加载一行大文件。
Open the smaller file. 打开较小的文件。
Do a linear search in the small file, from disk 从磁盘对小文件进行线性搜索
Open the output file and write to it. 打开输出文件并将其写入。
Rinse and repeat. 冲洗并重复。

All these steps are done 900k+ times. 所有这些步骤完成了900k次以上。

Step #2, opening the smaller file, should only ever be done once. 步骤＃2（打开较小的文件）应该只执行一次。 Opening a file and loading it from disk is an expensive operation. 打开文件并从磁盘加载它是一项昂贵的操作。 Just from loading it once at the beginning and doing the linear search (step #3) in memory, you would see great improvement. 仅仅从一开始就加载一次并在内存中执行线性搜索（第3步），您就会看到很大的改进。

The same goes for step #4: opening the output file should only be done once. 步骤＃4同样：打开输出文件只能执行一次。 The system will flush the file to disk every time you close it. 每次关闭文件时，系统都会将文件刷新到磁盘。 This is a very wasteful step. 这是非常浪费的步骤。 If you keep the file open, output data a buffered until there is enough to write a full block to the disk, which is a much faster way to accomplish that. 如果使文件保持打开状态，则将数据缓冲输出，直到有足够的空间将完整的块写入磁盘为止，这是完成此操作的更快方法。

Step #3 can be optimized a lot by using the correct data structure. 通过使用正确的数据结构，可以对步骤3进行很多优化。 One of the most common uses of probability in daily life is the hash table. 哈希表是日常生活中最常见的概率用途之一。 They are ubiquitous because they make lookup a constant-time operation (unlike linear search, which scales linearly with the size of your input). 它们无处不在，因为它们使查找成为恒定时间的操作（与线性搜索不同，线性搜索随输入大小而线性缩放）。 Hash tables are implemented in the dict class in Python. 哈希表在Python的dict类中实现。 By creating a dict with address as the key, you can reduce your processing time to a multiple of 900k + 80k rather than one of 900k * 80k . 通过创建一个以address为键的dict ，可以将处理时间减少到900k + 80k的倍数，而不是900k * 80k的倍数。 Look up algorithmic complexity to learn more. 查找算法复杂度以了解更多信息。 I particularly recommend Steve Skiena's "The Algorithm Design Manual". 我特别推荐Steve Skiena的“算法设计手册”。

One final step is to find the intersection of the address in each file. 最后一步是在每个文件中找到地址的交集。 There are a few options available. 有一些可用的选项。 You can convert both files into dict s and do a set -like intersection of the keys, or you can load one file into a dict and test the other one against it line-by-line. 您可以将两个文件都转换为dict并执行一set类似键的交集，也可以将一个文件加载到dict ，然后逐行对其进行测试。 I highly recommend the latter, with the smaller file as the one you load into a dict . 我强烈建议您使用后者，将较小的文件作为您加载到dict 。 From an algorithmic perspective, having 10 times fewer elements means that you reduce the probability of hash collisions. 从算法角度来看，元素减少10倍意味着您可以降低哈希冲突的可能性。 This is also the cheapest approach, since it fails fast on irrelevant lines of the larger file, without recording them. 这也是最便宜的方法，因为它会在较大文件的无关行上快速失败，而不会对其进行记录。 From a practical standpoint, you may not even have have the option of converting the larger file straightforwardly into a dictionary, if, as I suspect, it has multiple rows with the same address. 从实际的角度来看，如果我怀疑它有多个具有相同地址的行，您甚至可能没有选择将较大的文件直接转换为字典。

Here is an implementation of what I've been talking about: 这是我一直在谈论的实现：

with open('csv2.csv', newline='', encoding='Latin-1') as lookupfile:
    lookup = dict(csv.reader(lookupfile))

with open('csv1.csv', newline='', encoding='Latin-1') as masterfile, open('csv3.csv', 'w') as corder:
    reader = csv.reader(masterfile)
    corder.write('order_no,hub_id\n')
    for address, name, order_no in reader:
        hub_id = lookup.get(address)
        if hub_id is not None:
            corder.write(f'{order_no},{hub_id}\n')

The expression dict(csv.reader(lookupfile)) will fail if any of the rows are not exactly two elements long. 如果任何行的长度不完全是两个元素，则表达式dict(csv.reader(lookupfile))将失败。 For example, blank lines will crash it. 例如，空白行将使其崩溃。 This is because the constructor of dict expects an iterable of two-element sequences to initialize the key-value mappings. 这是因为dict的构造函数期望两个元素的序列可迭代来初始化键值映射。

As a minor optimization, I've not used csv.DictReader , as that requires extra processing for each line. 作为次要的优化，我没有使用csv.DictReader ，因为这需要对每一行进行额外的处理。 Furthermore, I've removed the csv module from the output entirely, since you can do the job much faster without adding layers of wrappers. 此外，我已经从输出中完全删除了csv模块，因为您可以更快地完成工作，而无需添加包装器。 If your files are as neatly formatted as you show, you may get a tiny performance boost from splitting them around , yourself, rather than using csv . 如果你的文件一样整洁格式化为你展示，你可以从身边分裂他们得到一个小的性能提升,你自己，而不是使用csv 。

Answer 2

it's long because: 很长一段时间是因为：

the complexity is O(n**2) . 复杂度为O(n**2) 。 never perform linear searches on big data like this 永远不会对像这样的大数据执行线性搜索
the constant file read/write adds to the toll 常量文件读/写会增加费用

You can do much better by creating 2 dictionaries with the address as key and the full row as value. 通过创建2个字典， 地址作为键 ，整行作为值，可以做得更好。

Then perform intersection of the keys, and write the result, picking data in each dictionary as required. 然后执行键的交点，并写入结果，并根据需要在每个字典中选择数据。

The following code was tested on your sample data 以下代码已针对您的示例数据进行了测试

import csv

with open('csv1.csv', newline='', encoding='Latin-1') as f:
    reader = csv.DictReader(f)
    master_dict = {row["address"]:row for row in reader}
with open('csv2.csv', newline='', encoding='Latin-1') as f:
    reader = csv.DictReader(f)
    secondary_dict = {row["address"]:row for row in reader}

# key intersection

common_keys = set(master_dict) & set(secondary_dict)

with open("result.csv", "w", newline='', encoding='Latin-1') as f:
    writer = csv.writer(f)
    writer.writerow(['order_no',"hub_id"])
    writer.writerows([master_dict[x]['order_no'],secondary_dict[x]["hub_id"]] for x in common_keys)

the result is: 结果是：

order_no,hub_id
d009,a109
d890,x345

如何优化代码以更快地处理？

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-07-11 12:04:13

解决方案2
1 2019-07-11 11:26:18

如何优化代码以更快地处理？

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-07-11 12:04:13

解决方案2 1 2019-07-11 11:26:18

解决方案1
2 已采纳 2019-07-11 12:04:13

解决方案2
1 2019-07-11 11:26:18