[英]Merge two CSV files based on a data from a column
I have two csv files like below. 我有两个csv文件,如下所示。
CSV1 CSV1
data13 data23 d main_data1;main_data2 data13 data23
data12 data22 d main_data1;main_data2 data12 data22
data11 data21 d main_data1;main_data2 data11 data21
data3 data4 d main_data2;main_data4 data3 data4
data52 data62 d main_data3 data51 data62
data51 data61 d main_data3 main_data3 data61
data7 data8 d main_data4 data7 data8
CSV2 CSV2
id1 main_data1 a1 a2 a3
id2 main_data2 b1 b2 b3
id3 main_data3 c1 c2 c3
id4 main_data4 d1 d2 d3
id5 main_data5 e1 e2 e3
Now my question is, I know how to merge two CSV files when one of the columns is exactly the same in both the files. 现在我的问题是,我知道当两个文件中的一列完全相同时如何合并两个CSV文件。 But my question is a little different.
但我的问题有点不同。 column 4 from CSV1 could contain column 2 from CSV2.
CSV1中的第4列可以包含CSV2中的第2列。 I'd like to get a CSV file as below
我想获得一个CSV文件,如下所示
FINAL_CSV FINAL_CSV
id1 main_data1 a1 a2 a3 data13
id2 main_data2 b1 b2 b3 data3
id3 main_data3 c1 c2 c3 main_data3
id4 main_data4 d1 d2 d3 data7
id5 main_data5 e1 e2 e3
where: 哪里:
1. it matches the data from both the columns and gets corresponding rows from the first occurrence and write to the csv file. 1.它匹配来自两列的数据,并从第一次出现获取相应的行并写入csv文件。
2. When there's no match, it can leave the last column in FINAL_CSV blank or write 'NA' or anything of that sort. 2.当没有匹配时,它可以将FINAL_CSV中的最后一列留空或写入'NA'或任何类似的东西。
3. When data in columns 4 and 5 of CSV1 match exactly, it returns that row instead of the first occurrence. 3.当CSV1的第4列和第5列中的数据完全匹配时,它将返回该行而不是第一次出现的行。
I'm totally lost on how to do this. 我完全迷失了如何做到这一点。 Helping with a part of it is fine too.
帮助它的一部分也很好。 Any suggestions are highly appreciated.
任何建议都非常感谢。
PS- I know data from csv file should be separated by a comma, but for the sake of clarity, I preferred tabs, though the actual data is separated by commas. PS-我知道来自csv文件的数据应该用逗号分隔,但为了清楚起见,我更喜欢制表符,尽管实际数据用逗号分隔。
EDIT: Actually, the 'main_data' can be in any column in CSV2, not in just column2. 编辑:实际上,'main_data'可以位于CSV2的任何列中,而不仅仅位于column2中。 The same 'main_data' could also repeat in multiple rows, then I'd like to get all the corresponding rows.
相同的'main_data'也可以在多行中重复,然后我想获得所有相应的行。
A way with (g)awk . (g)awk的一种方式。
awk -F, 'NR==FNR{a[$2]=$0;next}
{split($4,b,";");x=b[1]}
(x in a)&&!c[x]++{d[x]=$5}
($5 in a){d[$5]=$5}
END{n=asorti(a,e);for(i=1;i<=n;i++)print a[e[i]]","d[e[i]]}' CSV1 CSV2
Output 产量
id1,main_data1,a1,a2,a3,data13
id2,main_data2,b1,b2,b3,data3
id3,main_data3,c1,c2,c3,main_data3
id4,main_data4,d1,d2,d3,data7
id5,main_data5,e1,e2,e3,
Have you considered using pandas ? 你考虑过使用熊猫吗? If you are familiar with R, then data-frames should be pretty straightforward.
如果您熟悉R,那么数据框应该非常简单。 The following gives you what you want:
以下为您提供所需内容:
from pandas import merge, read_table
csv1 = read_table('CSV1.csv', sep=r"[;,]", header=None)
csv2 = read_table('CSV2.csv', sep=r"[,]", header=None)
print csv1
print csv2
Note that I replaced the tabs with commas and separated on semi-colons as well. 请注意,我用逗号替换了标签,并在分号上分隔。 The output so far should be:
到目前为止的输出应该是:
0 1 2 3 4 5 6
0 data13 data23 d main_data1 main_data2 data13 data23
1 data12 data22 d main_data1 main_data2 data12 data22
2 data11 data21 d main_data1 main_data2 data11 data21
3 data3 data4 d main_data2 main_data4 data3 data4
4 data52 data62 d main_data3 NaN data51 data62
5 data51 data61 d main_data3 NaN main_data3 data61
6 data7 data8 d main_data4 NaN data7 data8
[7 rows x 7 columns]
0 1 2 3 4
0 id1 main_data1 a1 a2 a3
1 id2 main_data2 b1 b2 b3
2 id3 main_data3 c1 c2 c3
3 id4 main_data4 d1 d2 d3
4 id5 main_data5 e1 e2 e3
[5 rows x 5 columns]
Using a left-join: 使用左连接:
kw1 = dict(how='left', \
left_on=[3,4], \
right_on=[1,1], \
suffixes=('l', 'r'))
df1 = merge(csv1, csv2, **kw1)
df1.drop_duplicates(cols=[3], inplace=True)
print df1[[0,7]]
Gives the zeroth and seventh column of the merge: 给出合并的第0和第7列:
3 5
0 main_data1 data13
3 main_data2 data3
4 main_data3 data51
6 main_data4 data7
[4 rows x 2 columns]
And to give the output as you want it, do another merge (this time an outer-join) with CSV2
: 要根据需要提供输出,请使用
CSV2
进行另一次合并(这次是外连接):
kw2 = dict(how='outer', \
left_on=[3], \
right_on=[1], \
suffixes=('l', 'r'))
df2 = merge(df1, csv2, **kw2)
print df2[[15,16,17,18,19,8]]
Output: 输出:
0 1 2 3r 4r 5
0 id1 main_data1 a1 a2 a3 data13
1 id2 main_data2 b1 b2 b3 data3
2 id3 main_data3 c1 c2 c3 data51
3 id4 main_data4 d1 d2 d3 data7
4 id5 main_data5 e1 e2 e3 NaN
You don't have to use **kw
for keyword arguments. 您不必使用
**kw
作为关键字参数。 I simply used it to make everything fit horizontally. 我只是用它来使一切都水平适合。
I let read_table
and merge
decide column names. 我让
read_table
和merge
决定列名。 If you assign column names yourself, you will get better looking output. 如果您自己分配列名,您将获得更好的输出。
Since the condition for merging seems to be complicated it might be worthwhile to load the data into a database and use SQL. 由于合并的条件似乎很复杂,因此将数据加载到数据库并使用SQL可能是值得的。 Using SQLite in-memory you can do this like this (assuming comma separated data)
在内存中使用SQLite你可以这样做(假设逗号分隔数据)
import csv
import sqlite3
def createTable(cursor, rows, tablename):
tableCreated = False
for row in rows:
if not tableCreated:
sql = "CREATE TABLE %s(ROW INTEGER PRIMARY KEY, " + ", ".join(["c%d" % (i+1) for i in range(len(row))]) + ")"
cur.execute(sql % tablename)
tableCreated = True
sql = "INSERT INTO %s VALUES(NULL, " + ", ".join(["'" + c + "'" for c in row]) + ")"
cur.execute(sql % tablename)
conn.commit()
conn = sqlite3.connect(":memory:")
cur = conn.cursor()
for filename, tablename in [(path_to_csv1, "CSV1"), (path_to_csv2, "CSV2")]:
with open(filename, "r") as f:
reader = csv.reader(f, delimiter=',')
rows = [row for row in reader]
createTable(cur, rows, tablename)
You can then formulate your join logic in SQL. 然后,您可以在SQL中制定连接逻辑。 You can run queries like this:
您可以运行以下查询:
for row in cur.execute(your_sql_statement):
print row
The following query gives the desired output: 以下查询提供了所需的输出:
WITH
MATCHES AS( -- get all matches
SELECT CSV2.*
, CSV1.ROW as ROW_1
, CSV1.C4 as C4_1
, CSV1.C5 as C5_1
FROM CSV2
LEFT JOIN CSV1
ON CSV1.C4 LIKE '%' || CSV2.C2 || '%'
),
EXACT AS( -- matches where CSV1.C4 = CSV1.C5
SELECT *
FROM MATCHES
WHERE C4_1 = C5_1
),
MIN_ROW AS( -- CSV1.ROW of first occurence for each CSV2.C1
SELECT C1
, min(ROW_1) as ROW_1
FROM MATCHES
WHERE C1 NOT IN (SELECT C1 FROM EXACT)
GROUP BY C1, C2, C3, C4, C5
)
-- use C4=C5 first
SELECT *
FROM EXACT
UNION
-- if match not in exact, use first occurence
SELECT MATCHES.*
FROM MIN_ROW
INNER JOIN MATCHES
ON MIN_ROW.C1 = MATCHES.C1
AND (MIN_ROW.ROW_1 = MATCHES.ROW_1 OR MIN_ROW.ROW_1 IS NULL)
ORDER BY C1
Since you originally asked for a Python solution to this I thought I would provide one. 既然你最初要求Python解决方案,我想我会提供一个。 The simplest solution that occurred was to first load
CSV1
and use it generate a mapping dictionary to use when generating the output from CSV2. 最简单的解决方案是首先加载
CSV1
并使用它生成一个映射字典,以便在从CSV2生成输出时使用。
If I understand the input file correctly, only the values to the left of the ;
如果我正确理解输入文件,只有左边的值
;
(if there is one) are to be considered. (如果有)将被考虑。 This can be achieved by using
split(';')
and taking element zero. 这可以通过使用
split(';')
和取零元素来实现。 If there is no ;
如果没有
;
then element zero will be the entire string. 那么元素零将是整个字符串。 Assignment to the
mapper
then just needs to follow the rules you've defined (only add if not already there, except when columns 4 & 5 match). 然后分配给
mapper
只需要遵循您定义的规则(只有在没有添加时才添加, 除非第 4和第5列匹配)。
The code below produces your requested output: 下面的代码生成您请求的输出:
import csv
mapper = dict()
with open('CSV1', 'r') as f1:
reader = csv.reader(f1)
for row in reader:
# Column 3 contains the match; but we only want the left-most (before semi-colon)
i = row[3].split(';')[0]
# Column 4 contains the target value for output
t = row[4]
if i not in mapper:
mapper[i] = t
elif row[3] == row[4]:
mapper[i] = t
with open('CSV2', 'r') as f2:
with open('FINAL_CSV', 'wb') as fo:
reader = csv.reader(f2)
writer = csv.writer(fo)
for row in reader:
if row[1] in mapper:
row.append( mapper[ row[1] ] )
writer.writerow(row)
The output file: 输出文件:
id1,main_data1,a1,a2,a3,data13
id2,main_data2,b1,b2,b3,data3
id3,main_data3,c1,c2,c3,main_data3
id4,main_data4,d1,d2,d3,data7
id5,main_data5,e1,e2,e3
To address the 'main_data can be in any column of CSV' modification use the following code: 要解决'main_data可以在CSV的任何列'修改,请使用以下代码:
for row in reader:
for r in row:
if r in mapper:
row.append( mapper[ r ] )
break
writer.writerow(row)
This will search each entry in the current row of CSV2 and if there is a match (to the original mapper data) append that mapped data to the row. 这将搜索当前CSV2行中的每个条目,如果匹配(对原始映射器数据),则将该映射数据附加到该行。 The row will then be written as before.
然后将像以前一样写入该行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.