[英]Linux - join 2 CSV files
I have 2 CSV files: 我有2个CSV文件:
file_1 columns: id,user_id,message_id,rate
file_2 columns: id,type,timestamp
The relation between the files is that file_1.message_id
= files_2.id
. 文件之间的关系是
file_1.message_id
= files_2.id
。
I want to create a 3rd file that will have the following columns: 我想创建一个包含以下列的第3个文件:
file_1.id,file_1.user_id,file_1.message_id,file_1.rate,file_2.timestamp
Any ideas on how to do this in Linux? 有关如何在Linux中执行此操作的任何想法?
You can use the join
command like this: 你可以像这样使用
join
命令:
join -t, -1 3 -2 1 -o 1.1 1.2 1.3 1.4 2.3 <(sort -t, -k 3,3 file1) <(sort file2)
It first sorts the files ( file1
is sorted by the 3rd field) and then joins them using the 3rd field of file1
and the 1st field of file2
. 它首先对文件进行排序(
file1
按第3个字段排序),然后使用file1
的第3个字段和file2
的第1个字段连接它们。 It then outputs the fields you need. 然后输出您需要的字段。
Seems to be a job for SQLite . 似乎是SQLite的工作。 Using the SQLite shell :
使用SQLite shell :
create table f1(id,user_id,message_id,rate);
create table f2(id,type,timestamp);
.separator ,
.import 'file_1.txt' f1
.import 'file_2.txt' f2
CREATE INDEX i1 ON f1(message_id ASC); -- optional
CREATE INDEX i2 ON f2(id ASC); -- optional
.output 'output.txt'
.separator ,
SELECT f1.id, f1.user_id, f1.message_id, f1.rate, f2.timestamp
FROM f1
JOIN f2 ON f2.id = f1.message_id;
.output stdout
.q
Note that if there is a single error in the number of commas in a single line the import stage will fail. 请注意,如果单行中的逗号数存在单个错误,则导入阶段将失败。 You can prevent the rest of the script from running with
.bail on
at the script beginning. 您可以在脚本开头阻止脚本的其余部分与
.bail on
运行。
If you want unmatched ids you can try: 如果您想要无与伦比的ID,您可以尝试:
SELECT f1.* FROM f1 LEFT JOIN f2 on f2.id = f1.message_id WHERE f2.id IS NULL
Which will select every row from f1
for which no corresponding row in f2
has been found. 这将选择每一行
f1
用于其在没有相应的行f2
已经找到。
With awk
you can try something like this - 有了
awk
你可以试试这样的东西 -
awk -F, 'NR==FNR{a[$3]=$0;next} ($1 in a){print a[$1]","$3 > "file_3"}' file_1 file_2
[jaypal:~/Temp] cat file_1 # Contents of File_1
id,user_id,message_id,rate
1,3334,424,44
[jaypal:~/Temp] cat file_2 # Contents of File_2
id,type,timestamp
424,rr,22222
[jaypal:~/Temp] awk -F, 'NR==FNR{a[$3]=$0;next} ($1 in a){print a[$1]","$3 > "file_3"}' file_1 file_2
[jaypal:~/Temp] cat file_3 # Contents of File_3 made by the script
1,3334,424,44,22222
You can try this: 你可以试试这个:
1. Change all lines to start with the key: 1.将所有行更改为以键开头:
awk -F',' { print $3 " file1 " $1 " " $2 " " $4 } < file1 > temp
awk -F',' { print $1 " file2 " $2 " " $3 } < file2 >> temp
Now the lines look like: 现在线条看起来像:
message_id file1 id user_id rate
id file2 type timestamp
Sort temp
by the first two columns. 按前两列排序
temp
。 Now related lines are adjacent, with file1
first 现在相关的行是相邻的,首先是
file1
sort -k 1,1 -k 2,2 < temp > temp2
Run awk
to read the lines. 运行
awk
来读取行。 In file1
lines save the fields, in file2
lines print them. 在
file1
行中保存字段,在file2
行中打印它们。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.