简体   繁体   English

根据列中的数据合并两个CSV文件

[英]Merge two CSV files based on a data from a column

I have two csv files like below. 我有两个csv文件,如下所示。

CSV1 CSV1

data13      data23      d      main_data1;main_data2      data13         data23
data12      data22      d      main_data1;main_data2      data12         data22
data11      data21      d      main_data1;main_data2      data11         data21
data3       data4       d      main_data2;main_data4      data3          data4
data52      data62      d      main_data3                 data51         data62
data51      data61      d      main_data3                 main_data3     data61
data7       data8       d      main_data4                 data7          data8

CSV2 CSV2

id1      main_data1      a1      a2      a3
id2      main_data2      b1      b2      b3
id3      main_data3      c1      c2      c3
id4      main_data4      d1      d2      d3
id5      main_data5      e1      e2      e3

Now my question is, I know how to merge two CSV files when one of the columns is exactly the same in both the files. 现在我的问题是,我知道当两个文件中的一列完全相同时如何合并两个CSV文件。 But my question is a little different. 但我的问题有点不同。 column 4 from CSV1 could contain column 2 from CSV2. CSV1中的第4列可以包含CSV2中的第2列。 I'd like to get a CSV file as below 我想获得一个CSV文件,如下所示

FINAL_CSV FINAL_CSV

id1      main_data1      a1      a2      a3      data13
id2      main_data2      b1      b2      b3      data3
id3      main_data3      c1      c2      c3      main_data3
id4      main_data4      d1      d2      d3      data7
id5      main_data5      e1      e2      e3

where: 哪里:
1. it matches the data from both the columns and gets corresponding rows from the first occurrence and write to the csv file. 1.它匹配来自两列的数据,并从第一次出现获取相应的行并写入csv文件。
2. When there's no match, it can leave the last column in FINAL_CSV blank or write 'NA' or anything of that sort. 2.当没有匹配时,它可以将FINAL_CSV中的最后一列留空或写入'NA'或任何类似的东西。
3. When data in columns 4 and 5 of CSV1 match exactly, it returns that row instead of the first occurrence. 3.当CSV1的第4列和第5列中的数据完全匹配时,它将返回该行而不是第一次出现的行。

I'm totally lost on how to do this. 我完全迷失了如何做到这一点。 Helping with a part of it is fine too. 帮助它的一部分也很好。 Any suggestions are highly appreciated. 任何建议都非常感谢。
PS- I know data from csv file should be separated by a comma, but for the sake of clarity, I preferred tabs, though the actual data is separated by commas. PS-我知道来自csv文件的数据应该用逗号分隔,但为了清楚起见,我更喜欢制表符,尽管实际数据用逗号分隔。

EDIT: Actually, the 'main_data' can be in any column in CSV2, not in just column2. 编辑:实际上,'main_data'可以位于CSV2的任何列中,而不仅仅位于column2中。 The same 'main_data' could also repeat in multiple rows, then I'd like to get all the corresponding rows. 相同的'main_data'也可以在多行中重复,然后我想获得所有相应的行。

A way with (g)awk . (g)awk的一种方式。

 awk -F, 'NR==FNR{a[$2]=$0;next}
         {split($4,b,";");x=b[1]}
         (x in a)&&!c[x]++{d[x]=$5}
         ($5 in a){d[$5]=$5}
         END{n=asorti(a,e);for(i=1;i<=n;i++)print a[e[i]]","d[e[i]]}'  CSV1 CSV2

Output 产量

id1,main_data1,a1,a2,a3,data13
id2,main_data2,b1,b2,b3,data3
id3,main_data3,c1,c2,c3,main_data3
id4,main_data4,d1,d2,d3,data7
id5,main_data5,e1,e2,e3,

Have you considered using pandas ? 你考虑过使用熊猫吗? If you are familiar with R, then data-frames should be pretty straightforward. 如果您熟悉R,那么数据框应该非常简单。 The following gives you what you want: 以下为您提供所需内容:

from pandas import merge, read_table

csv1 = read_table('CSV1.csv', sep=r"[;,]", header=None)
csv2 = read_table('CSV2.csv', sep=r"[,]",  header=None)

print csv1
print csv2

Note that I replaced the tabs with commas and separated on semi-colons as well. 请注意,我用逗号替换了标签,并在分号上分隔。 The output so far should be: 到目前为止的输出应该是:

        0       1   2           3           4           5       6
0  data13  data23   d  main_data1  main_data2      data13  data23
1  data12  data22   d  main_data1  main_data2      data12  data22
2  data11  data21   d  main_data1  main_data2      data11  data21
3   data3   data4   d  main_data2  main_data4       data3   data4
4  data52  data62   d  main_data3         NaN      data51  data62
5  data51  data61   d  main_data3         NaN  main_data3  data61
6   data7   data8   d  main_data4         NaN       data7   data8

[7 rows x 7 columns]
     0           1   2   3   4
0  id1  main_data1  a1  a2  a3
1  id2  main_data2  b1  b2  b3
2  id3  main_data3  c1  c2  c3
3  id4  main_data4  d1  d2  d3
4  id5  main_data5  e1  e2  e3

[5 rows x 5 columns]

Using a left-join: 使用左连接:

kw1 = dict(how='left', \
          left_on=[3,4], \
          right_on=[1,1], \
          suffixes=('l', 'r'))

df1 = merge(csv1, csv2, **kw1)
df1.drop_duplicates(cols=[3], inplace=True)

print df1[[0,7]]

Gives the zeroth and seventh column of the merge: 给出合并的第0和第7列:

            3       5
0  main_data1  data13
3  main_data2   data3
4  main_data3  data51
6  main_data4   data7

[4 rows x 2 columns]

And to give the output as you want it, do another merge (this time an outer-join) with CSV2 : 要根据需要提供输出,请使用CSV2进行另一次合并(这次是外连接):

kw2 = dict(how='outer', \
           left_on=[3], \
           right_on=[1], \
           suffixes=('l', 'r'))

df2 = merge(df1, csv2, **kw2)

print df2[[15,16,17,18,19,8]]

Output: 输出:

     0           1   2  3r  4r       5
0  id1  main_data1  a1  a2  a3  data13
1  id2  main_data2  b1  b2  b3   data3
2  id3  main_data3  c1  c2  c3  data51
3  id4  main_data4  d1  d2  d3   data7
4  id5  main_data5  e1  e2  e3     NaN

You don't have to use **kw for keyword arguments. 您不必使用**kw作为关键字参数。 I simply used it to make everything fit horizontally. 我只是用它来使一切都水平适合。

I let read_table and merge decide column names. 我让read_tablemerge决定列名。 If you assign column names yourself, you will get better looking output. 如果您自己分配列名,您将获得更好的输出。

Since the condition for merging seems to be complicated it might be worthwhile to load the data into a database and use SQL. 由于合并的条件似乎很复杂,因此将数据加载到数据库并使用SQL可能是值得的。 Using SQLite in-memory you can do this like this (assuming comma separated data) 在内存中使用SQLite你可以这样做(假设逗号分隔数据)

import csv
import sqlite3

def createTable(cursor, rows, tablename):
    tableCreated = False
    for row in rows:
        if not tableCreated:
            sql = "CREATE TABLE %s(ROW INTEGER PRIMARY KEY, " + ", ".join(["c%d" % (i+1) for i in range(len(row))]) + ")"
            cur.execute(sql % tablename)
            tableCreated = True
        sql = "INSERT INTO %s VALUES(NULL, " + ", ".join(["'" + c + "'" for c in row]) + ")"
        cur.execute(sql % tablename)
    conn.commit()


conn = sqlite3.connect(":memory:")
cur = conn.cursor()

for filename, tablename in [(path_to_csv1, "CSV1"), (path_to_csv2, "CSV2")]:
    with open(filename, "r") as f:
        reader = csv.reader(f, delimiter=',')        
        rows = [row for row in reader]
    createTable(cur, rows, tablename)

You can then formulate your join logic in SQL. 然后,您可以在SQL中制定连接逻辑。 You can run queries like this: 您可以运行以下查询:

for row in cur.execute(your_sql_statement):
    print row

The following query gives the desired output: 以下查询提供了所需的输出:

WITH
MATCHES AS( -- get all matches
    SELECT      CSV2.*
                , CSV1.ROW as ROW_1                 
                , CSV1.C4 as C4_1
                , CSV1.C5 as C5_1
    FROM        CSV2 
    LEFT JOIN   CSV1 
    ON          CSV1.C4 LIKE '%' || CSV2.C2 || '%'    
),
EXACT AS( -- matches where CSV1.C4 = CSV1.C5
    SELECT      *
    FROM        MATCHES
    WHERE       C4_1 = C5_1
),
MIN_ROW AS( -- CSV1.ROW of first occurence for each CSV2.C1
    SELECT      C1
                , min(ROW_1) as ROW_1
    FROM        MATCHES
    WHERE       C1 NOT IN (SELECT C1 FROM EXACT)
    GROUP BY    C1, C2, C3, C4, C5                  
)
-- use C4=C5 first
SELECT      *
FROM        EXACT
UNION
-- if match not in exact, use first occurence
SELECT      MATCHES.*
FROM        MIN_ROW
INNER JOIN  MATCHES
ON          MIN_ROW.C1 = MATCHES.C1
AND         (MIN_ROW.ROW_1 = MATCHES.ROW_1 OR MIN_ROW.ROW_1 IS NULL)
ORDER BY    C1

Since you originally asked for a Python solution to this I thought I would provide one. 既然你最初要求Python解决方案,我想我会提供一个。 The simplest solution that occurred was to first load CSV1 and use it generate a mapping dictionary to use when generating the output from CSV2. 最简单的解决方案是首先加载CSV1并使用它生成一个映射字典,以便在从CSV2生成输出时使用。

If I understand the input file correctly, only the values to the left of the ; 如果我正确理解输入文件,只有左边的值; (if there is one) are to be considered. (如果有)将被考虑。 This can be achieved by using split(';') and taking element zero. 这可以通过使用split(';')和取零元素来实现。 If there is no ; 如果没有; then element zero will be the entire string. 那么元素零将是整个字符串。 Assignment to the mapper then just needs to follow the rules you've defined (only add if not already there, except when columns 4 & 5 match). 然后分配给mapper只需要遵循您定义的规则(只有在没有添加时才添加, 除非第 4和第5列匹配)。

The code below produces your requested output: 下面的代码生成您请求的输出:

import csv

mapper = dict()
with open('CSV1', 'r') as f1:
    reader = csv.reader(f1)
    for row in reader:
        # Column 3 contains the match; but we only want the left-most (before semi-colon)
        i = row[3].split(';')[0]
        # Column 4 contains the target value for output
        t = row[4]
        if i not in mapper:
            mapper[i] = t
        elif row[3] == row[4]:
            mapper[i] = t        

with open('CSV2', 'r') as f2:
    with open('FINAL_CSV', 'wb') as fo:
        reader = csv.reader(f2)
        writer = csv.writer(fo)
        for row in reader:
            if row[1] in mapper:
                row.append( mapper[ row[1] ] )
            writer.writerow(row)

The output file: 输出文件:

id1,main_data1,a1,a2,a3,data13
id2,main_data2,b1,b2,b3,data3
id3,main_data3,c1,c2,c3,main_data3
id4,main_data4,d1,d2,d3,data7
id5,main_data5,e1,e2,e3

To address the 'main_data can be in any column of CSV' modification use the following code: 要解决'main_data可以在CSV的任何列'修改,请使用以下代码:

for row in reader:
    for r in row:
        if r in mapper:
            row.append( mapper[ r ] )
            break

    writer.writerow(row)

This will search each entry in the current row of CSV2 and if there is a match (to the original mapper data) append that mapped data to the row. 这将搜索当前CSV2行中的每个条目,如果匹配(对原始映射器数据),则将该映射数据附加到该行。 The row will then be written as before. 然后将像以前一样写入该行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM