简体   繁体   English

在Python中合并两个CSV文件

[英]Merge Two CSV files in Python

I have two csv files and I want to create a third csv from the a merge of the two. 我有两个csv文件,我想从两个合并中创建第三个csv。 Here's how my files look: 这是我的文件的外观:

Num | Num | status 状态
1213 | 1213 | closed 关闭
4223 | 4223 | open 打开
2311 | 2311 | open 打开

and another file has this: 另一个文件有这个:

Num | Num | code
1002 | 1002 | 9822 9822
1213 | 1213 | 1891 1891年
4223 | 4223 | 0011 0011

So, here is my little code that I was trying to loop through but it does not print the output with the third column added matching the correct values. 所以,这是我试图循环的小代码,但它没有打印输出,第三列添加了匹配正确的值。

def links():
    first = open('closed.csv')
    csv_file = csv.reader(first)

    second = open('links.csv')
    csv_file2 = csv.reader(second)

    for row in csv_file:  
        for secrow in csv_file2:                             
            if row[0] == secrow[0]:
                print row[0]+"," +row[1]+","+ secrow[0]
                time.sleep(1)

so what I want is something like: 所以我想要的是:

Num | Num | status | 状态| code
1213 | 1213 | closed | 关闭| 1891 1891年
4223 | 4223 | open | 打开| 0011 0011
2311 | 2311 | open | 打开| blank no match 空白不匹配

This is definitely a job for pandas . 这绝对是熊猫的工作。 You can easily read in both csv files as DataFrames and use either merge or concat. 您可以轻松地将两个csv文件作为DataFrame读取,并使用merge或concat。 It'll be way faster and you can do it in just a few lines of code. 它会更快,你只需几行代码即可完成。

If you decide to use pandas , you can do it in only five lines. 如果你决定使用pandas ,你可以只用五行。

import pandas as pd

first = pd.read_csv('closed.csv')
second = pd.read_csv('links.csv')

merged = pd.merge(first, second, how='left', on='Num')
merged.to_csv('merged.csv', index=False)

You could read the values of the second file into a dictionary and then add them to the first. 您可以将第二个文件的值读入字典,然后将它们添加到第一个文件中。

Code = {}
for row in csv_file2:
    Code[row[0]] = row[1]

for row in csv_file1:
    row.append(Code.get(row[0], "blank no match"))

The problem is that you could iterate over a csv reader only once, so that csv_file2 does not work after the first iteration. 问题是你只能在csv阅读器上迭代一次,这样csv_file2在第一次迭代后就不起作用了。 To solve that you should save the output of csv_file2 and iterate over the saved list. 要解决这个问题,您应该保存csv_file2的输出并迭代保存的列表。 It could look like that: 它可能看起来像这样:

import time, csv


def links():
    first = open('closed.csv')
    csv_file = csv.reader(first, delimiter="|")


    second = open('links.csv')
    csv_file2 = csv.reader(second, delimiter="|")

    list=[]
    for row in csv_file2:
        list.append(row)


    for row in csv_file:
        match=False  
        for secrow in list:                             
            if row[0].replace(" ","") == secrow[0].replace(" ",""):
                print row[0] + "," + row[1] + "," + secrow[1]
                match=True
        if not match:
            print row[0] + "," + row[1] + ", blank no match" 
        time.sleep(1)

Output: 输出:

Num , status, code
1213 , closed, 1891
4223 , open, 0011
2311 , open, blank no match

This code will do it for you: 这段代码将为您完成:

import csv

def links():

    # open both files
    with open('closed.csv') as closed, open('links.csv') as links:

        # using DictReader instead to be able more easily access information by num
        csv_closed = csv.DictReader(closed)
        csv_links = csv.DictReader(links)

         # create dictionaries out of the two CSV files using dictionary comprehensions
        num_dict = {row['num']:row['status'] for row in csv_closed}
        link_dict = {row['num']:row['code'] for row in csv_links}   

    # print header, each column has width of 8 characters
    print("{0:8} | {1:8} | {2:8}".format("Num", "Status", "Code"))

    # print the information
    for num, status in num_dict.items():

        # note this call to link_dict.get() - we are getting values out of the link dictionary,
        # but specifying a default return value of an empty string if num is not found in it
        # to avoid an exception
        print("{0:8} | {1:8} | {2:8}".format(num, status, link_dict.get(num, '')))

links()

In it, I'm taking advantage of dictionaries, which let you access information by keys. 在其中,我正在利用字典,它允许您通过键访问信息。 I'm also using implicit loops (the dictionary comprehensions) which tend to be faster and require less code. 我也使用隐式循环(字典理解),它往往更快,需要更少的代码。

There are two quirks of this code that you should be aware of, that your example suggests are fine: 您应该注意这个代码有两个怪癖,您的示例建议很好:

  1. Order is not preserved (because we're using dictionaries) 订单未保留(因为我们正在使用词典)
  2. Num entries that are in links.csv but not closed.csv are not included in the printout 打印输出中不包含links.csv中的Num条目,但不包括closed.csv

Last note: I made some assumptions about how your input files are formatted since you called them "CSV" files. 最后一点:由于您将输入文件称为“CSV”文件,因此我对输入文件的格式进行了一些假设。 This is what my input files looked like for this code: 这是我的输入文件对于此代码的样子:

closed.csv closed.csv

num,status NUM,状态
1213,closed 1213,收
4223,open 4223,开
2311,open 2311,开

links.csv links.csv

num,code NUM,代码
1002,9822 1002,9822
1213,1891 1213,1891
4223,0011 4223,0011

Given those input files, the result looks like this: 给定这些输入文件,结果如下所示:

Num      | Status   | Code  
1213     | closed   | 1891  
2311     | open     |  
4223     | open     | 0011  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM