[英]How to extract lines from two textfiles linked by heading number from the 1st 10 characters?
I have two files: 我有两个文件:
file1.txt : file1.txt :
0000001435 XYZ 与 ABC
0000001438warlaugh 世界
file1.txt : file1.txt :
0000001435 XYZ with abc
0000001436 DFC whatever
0000001437 FBFBBBF
0000001438 world of warlaugh
The lines in the separated file are linked by the number (1st 10 characters). 分隔文件中的行由数字链接(第1个10个字符)。 The desired output is a tab separated file with lines that exists and
file1.txt
and the corresponding lines from file2.txt
: 所需的输出是一个制表符分隔的文件,其中包含存在的行和
file1.txt
以及来自file2.txt
的相应行:
file3.txt : file3.txt :
XYZ 与 ABC XYZ with abc
warlaugh 世界 world of warlaugh
How do I get the corresponding lines and then create a tab separated file with lines that exists in file1.txt
to produce file3.txt
? 如何获取相应的行,然后使用制表符分隔文件,并使用
file1.txt
中存在的行来生成file3.txt
?
Note that only the first 10 character constitutes as the ID. 注意,只有前10个字符构成ID。 , there are cases like
0000001438warlaugh 世界
or even 0000001432231hahaha lol
and only the 0000001438
and 0000001432
is the ID. ,例如
0000001438warlaugh 世界
甚至是0000001432231hahaha lol
,只有0000001438
和0000001432
是ID。
I tried with python, getfile3.py : 我尝试使用python getfile3.py :
import io
f1 = {line[:10]:line[10:].strip() for line in io.open('file1.txt', 'r', encoding='utf8')}
f2 = {line[:10]:line[10:].strip() for line in io.open('file1.txt', 'r', encoding='utf8')}
f3 = io.open('file3.txt', 'w', encoding='utf8')
for i in f1:
f3.write(u"{}\t{}\n".format(f1[i], f2[i]))
But is there a bash/awk/grep/perl command-line way that i can get file3.txt
? 但是是否有bash / awk / grep / perl命令行方式可以获取
file3.txt
?
awk '
{ key = substr($0,1,10); data = substr($0,11) }
NR==FNR { file1[key] = data; next }
key in file1 { print file1[key] data }
' file1 file2
You could use FIELDWIDTHS with GNU awk rather than substr() if you prefer. 如果愿意,可以将FIELDWIDTHS与GNU awk结合使用,而不要与substr()结合使用。
Super long Perl answer: 超长Perl答案:
use warnings;
use strict;
# add files here as needed
my @input_files = qw(file1.txt file2.txt);
my $output_file = 'output.txt';
# don't touch anything below this line
my @output_lines = parse_files(@input_files);
open (my $output_fh, ">", $output_file) or die;
foreach (@output_lines) {
print $output_fh "$_\n"; #print to output file
print "$_\n"; #print to console
}
close $output_fh;
sub parse_files {
my @input_files = @_; #list of text files to read.
my %data; #will store $data{$index} = datum1 datum2 datum3
foreach my $file (@input_files) {
open (my $fh, "<", $file) or die;
while (<$fh>) {
chomp;
if (/^(\d{10})\s?(.*)$/) {
my $index = $1;
my $datum = $2;
if (exists $data{$index}) {
$data{$index} .= "\t$datum";
} else {
$data{$index} = $datum;
} #/else
} #/if regex found
} #/while reading current file
close $fh;
} #/foreach file
# Create output array
my @output_lines;
foreach my $key (sort keys %data) {
push (@output_lines, "$data{$key}");
} #/foreach
return @output_lines;
} #/sub parse_files
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.