将 2 个列表与嵌套的 for 循环进行比较

Question

我有 2 个 CSV 文件（每个文件都有 1000 多行），如下所示：

网址.csv

https://github.com/spacewalkproject/spacewalk
https://github.com/troglobit/uftpd
https://github.com/danschultzer/pow
https://github.com/opencast/opencast
https://github.com/ipmitool/ipmitool
https://github.com/NetHack/NetHack
https://github.com/NetHack/NetHack
https://github.com/tensorflow/tensorflow
https://github.com/twitter/secure_headers
https://github.com/twitter/secure_headers
...

2.csv

JavaScript,46.70%,https://github.com/jsomara/katello
Ruby,57.50%,https://github.com/Katello/katello
Java,82.30%,https://github.com/candlepin/candlepin
PHP,86.10%,https://github.com/roundcube/roundcubemail
C,96.60%,https://github.com/torvalds/linux
JavaScript,82.60%,https://github.com/jonrohan/ZeroClipboard
PHP,71.10%,https://github.com/nshahzad/phpVMS
Augeas,59.80%,https://github.com/hercules-team/augeas
null,null,https://github.com/horde/horde
JavaScript,88.00%,https://github.com/jquery/jquery-ui
...

当两个文件中的 url 匹配时，我想将2.csv额外信息2.csv到urls.csv中。

我的代码：

import csv

with open('urls.csv') as f_input, open('2.csv') as f2_input, open('result.csv', 'w', newline="") as f_output:

    csv_input = csv.reader(f_input)
    csv_input2 = csv.reader(f2_input)

    csv_output = csv.writer(f_output,delimiter=",")

    for url in csv_input:
        for row in csv_input2:
            if(url[0]==row[2]):
                Language=row[0]
                Percentage=row[1]
                csv_output.writerow([url[0],Language,Percentage])

我的代码只产生这一行：

https://github.com/spacewalkproject/spacewalk,Java,58.50%

问题：此代码仅将urls.csv的第一行与urls.csv 2.csv ，然后停止。 我确信有超过 1000 个这些 url 可以匹配。

Answer 1

问题是您第一次通过csv_input ，它会读取整个文件，并到达结尾。 第二次通过，没有任何东西可以读取，所以什么也没有找到。 快速解决方法是将open('2.csv') as f2_input到外部 for 循环中。

这种方法的问题是，你会读csv_input2一次在每个URL csv_input1 ，这是远远慢于它需要。

解决这个问题的更好方法是首先避免嵌套循环。 进行第一次传递，将所有 url 添加到一个集合中：

urls = Set()
for url in csv_input:
   urls.add(url)

现在您在内存中拥有所有 url，循环遍历第二个 CSV 文件，并根据所有 url 检查每一行：

for row in csv_input2:
    url = row[2]
    if url in urls:
        Language=row[0]
        Percentage=row[1]
        csv_output.writerow([url,Language,Percentage])

但是请注意，这不会按照原始文件中 url 的顺序对结果进行排序。 一种可能的方法是使用列表而不是集合（以保持顺序），然后在 for 循环之后进行排序阶段。

Answer 2

我会从csv_input2做一个 dict ，以 url 为键，其余为值：

csv_input = csv.reader(f_input)
csv_input2 = csv.reader(f2_input)

csv_output = csv.writer(f_output,delimiter=",")

data = {row[2]: (row[0], row[1] for row in csv_input2}

for url in csv_input:
    try:
        d = data[url[0]]
        csv_output.writerow([url[0],*d])
    except KeyError:
        pass

我正在使用 try/except，因为请求宽恕而不是许可更快。 其余的应该是不言自明的

Answer 3

csv 文件对象是生成器，因此当您遍历循环时，它会到达文件末尾。 所以下次没有更多的项目要迭代。 所以在列表中分配出csv然后执行操作。
output_1= [csv_input 中 url 的 url]
output_2= [csv_input2 中的一行行]

import csv

with open('urls.csv') as f_input, open('2.csv') as f2_input, open('result.csv', 'w', newline="") as f_output:

    csv_input = csv.reader(f_input)
    csv_input2 = csv.reader(f2_input)

    csv_output = csv.writer(f_output,delimiter=",")

    output_1= [url for url in csv_input]
    output_2= [row for row in csv_input2]

    for url in output_1:
        for row in output_2:
            if(url[0]==row[2]):
                Language=row[0]
                Percentage=row[1]
                csv_output.writerow([url[0],Language,Percentage])

Answer 4

它停止的原因是您实际上消耗（即读取）内循环中的所有第二个文件，然后在第二次迭代（即第一个文件的第二行）中，内循环实际上并未运行。

两种可能的解决方案 - 1. 使用 Pandas 并将文件作为数据帧读取 2. 确保您打开并读取内部循环中的第二个文件 \\ 回到文件的开头

将 2 个列表与嵌套的 for 循环进行比较

问题描述

4 个解决方案

解决方案1
1 2020-02-27 18:36:28

解决方案2
1 已采纳 2020-02-27 18:39:32

解决方案3
1 2020-02-27 18:55:52

解决方案4
0 2020-02-27 18:36:01

将 2 个列表与嵌套的 for 循环进行比较

问题描述

4 个解决方案

解决方案1 1 2020-02-27 18:36:28

解决方案2 1 已采纳 2020-02-27 18:39:32

解决方案3 1 2020-02-27 18:55:52

解决方案4 0 2020-02-27 18:36:01

解决方案1
1 2020-02-27 18:36:28

解决方案2
1 已采纳 2020-02-27 18:39:32

解决方案3
1 2020-02-27 18:55:52

解决方案4
0 2020-02-27 18:36:01