简体   繁体   English

Linux - 加入2个CSV文件

[英]Linux - join 2 CSV files

I have 2 CSV files: 我有2个CSV文件:

file_1 columns: id,user_id,message_id,rate
file_2 columns: id,type,timestamp

The relation between the files is that file_1.message_id = files_2.id . 文件之间的关系是file_1.message_id = files_2.id

I want to create a 3rd file that will have the following columns: 我想创建一个包含以下列的第3个文件:

file_1.id,file_1.user_id,file_1.message_id,file_1.rate,file_2.timestamp

Any ideas on how to do this in Linux? 有关如何在Linux中执行此操作的任何想法?

You can use the join command like this: 你可以像这样使用join命令:

join -t, -1 3 -2 1 -o 1.1 1.2 1.3 1.4 2.3 <(sort -t, -k 3,3 file1) <(sort file2)

It first sorts the files ( file1 is sorted by the 3rd field) and then joins them using the 3rd field of file1 and the 1st field of file2 . 它首先对文件进行排序( file1按第3个字段排序),然后使用file1的第3个字段和file2的第1个字段连接它们。 It then outputs the fields you need. 然后输出您需要的字段。

Seems to be a job for SQLite . 似乎是SQLite的工作。 Using the SQLite shell : 使用SQLite shell

 create table f1(id,user_id,message_id,rate);
 create table f2(id,type,timestamp);

 .separator ,
 .import 'file_1.txt' f1
 .import 'file_2.txt' f2

 CREATE INDEX i1 ON f1(message_id ASC); -- optional
 CREATE INDEX i2 ON f2(id ASC);         -- optional

 .output 'output.txt'
 .separator ,

 SELECT f1.id, f1.user_id, f1.message_id, f1.rate, f2.timestamp
   FROM f1
   JOIN f2 ON f2.id = f1.message_id;

 .output stdout
 .q

Note that if there is a single error in the number of commas in a single line the import stage will fail. 请注意,如果单行中的逗号数存在单个错误,则导入阶段将失败。 You can prevent the rest of the script from running with .bail on at the script beginning. 您可以在脚本开头阻止脚本的其余部分与.bail on运行。

If you want unmatched ids you can try: 如果您想要无与伦比的ID,您可以尝试:

SELECT f1.* FROM f1 LEFT JOIN f2 on f2.id = f1.message_id WHERE f2.id IS NULL

Which will select every row from f1 for which no corresponding row in f2 has been found. 这将选择每一行f1用于其在没有相应的行f2已经找到。

With awk you can try something like this - 有了awk你可以试试这样的东西 -

awk -F, 'NR==FNR{a[$3]=$0;next} ($1 in a){print a[$1]","$3 > "file_3"}' file_1 file_2

Test: 测试:

[jaypal:~/Temp] cat file_1     # Contents of File_1
id,user_id,message_id,rate
1,3334,424,44

[jaypal:~/Temp] cat file_2     # Contents of File_2
id,type,timestamp
424,rr,22222

[jaypal:~/Temp] awk -F, 'NR==FNR{a[$3]=$0;next} ($1 in a){print a[$1]","$3 > "file_3"}' file_1 file_2

[jaypal:~/Temp] cat file_3     # Contents of File_3 made by the script
1,3334,424,44,22222

You can try this: 你可以试试这个:
1. Change all lines to start with the key: 1.将所有行更改为以键开头:

awk -F',' { print $3 " file1 " $1 " " $2 " " $4 } < file1 >  temp
awk -F',' { print $1 " file2 " $2 " " $3 }        < file2 >> temp

Now the lines look like: 现在线条看起来像:

message_id file1 id user_id rate
id file2 type timestamp
  1. Sort temp by the first two columns. 按前两列排序temp Now related lines are adjacent, with file1 first 现在相关的行是相邻的,首先是file1

    sort -k 1,1 -k 2,2 < temp > temp2

  2. Run awk to read the lines. 运行awk来读取行。 In file1 lines save the fields, in file2 lines print them. file1行中保存字段,在file2行中打印它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM