简体   繁体   中英

How to extract lines from two textfiles linked by heading number from the 1st 10 characters?

I have two files:

file1.txt :

0000001435 XYZ 与 ABC
0000001438warlaugh 世界

file1.txt :

0000001435 XYZ with abc
0000001436 DFC whatever
0000001437 FBFBBBF
0000001438 world of warlaugh

The lines in the separated file are linked by the number (1st 10 characters). The desired output is a tab separated file with lines that exists and file1.txt and the corresponding lines from file2.txt :

file3.txt :

XYZ 与 ABC   XYZ with abc
warlaugh 世界 world of warlaugh

How do I get the corresponding lines and then create a tab separated file with lines that exists in file1.txt to produce file3.txt ?

Note that only the first 10 character constitutes as the ID. , there are cases like 0000001438warlaugh 世界 or even 0000001432231hahaha lol and only the 0000001438 and 0000001432 is the ID.

I tried with python, getfile3.py :

import io
f1 = {line[:10]:line[10:].strip() for line in io.open('file1.txt', 'r', encoding='utf8')}
f2 = {line[:10]:line[10:].strip() for line in io.open('file1.txt', 'r', encoding='utf8')}

f3 = io.open('file3.txt', 'w', encoding='utf8') 

for i in f1:
  f3.write(u"{}\t{}\n".format(f1[i], f2[i]))

But is there a bash/awk/grep/perl command-line way that i can get file3.txt ?

awk '
{ key = substr($0,1,10); data = substr($0,11) }
NR==FNR { file1[key] = data; next }
key in file1 { print file1[key] data }
' file1 file2

You could use FIELDWIDTHS with GNU awk rather than substr() if you prefer.

Super long Perl answer:

use warnings;
use strict;

# add files here as needed
my @input_files = qw(file1.txt file2.txt);
my $output_file = 'output.txt';

# don't touch anything below this line
my @output_lines = parse_files(@input_files);

open (my $output_fh, ">", $output_file) or die;
foreach (@output_lines) {
    print $output_fh "$_\n";                    #print to output file
    print "$_\n";                               #print to console
}
close $output_fh;

sub parse_files {
    my @input_files = @_;                       #list of text files to read.
    my %data;                                   #will store $data{$index} = datum1 datum2 datum3

    foreach my $file (@input_files) {           
        open (my $fh, "<", $file) or die;       
        while (<$fh>) { 
            chomp;                              
            if (/^(\d{10})\s?(.*)$/) {
                my $index = $1;
                my $datum = $2;
                if (exists $data{$index}) {
                    $data{$index} .= "\t$datum";
                } else {
                    $data{$index} = $datum;
                } #/else
            } #/if regex found
        } #/while reading current file
        close $fh;
    } #/foreach file

    # Create output array
    my @output_lines;
    foreach my $key (sort keys %data) {
        push (@output_lines, "$data{$key}");
    } #/foreach

    return @output_lines;
} #/sub parse_files

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM