如何从前10个字符的标题编号链接的两个文本文件中提取行？

Question

I have two files: 我有两个文件：

file1.txt : file1.txt ：

0000001435 XYZ 与 ABC
0000001438warlaugh 世界

file1.txt : file1.txt ：

0000001435 XYZ with abc
0000001436 DFC whatever
0000001437 FBFBBBF
0000001438 world of warlaugh

The lines in the separated file are linked by the number (1st 10 characters). 分隔文件中的行由数字链接（第1个10个字符）。 The desired output is a tab separated file with lines that exists and file1.txt and the corresponding lines from file2.txt : 所需的输出是一个制表符分隔的文件，其中包含存在的行和file1.txt以及来自file2.txt的相应行：

file3.txt : file3.txt ：

XYZ 与 ABC   XYZ with abc
warlaugh 世界 world of warlaugh

How do I get the corresponding lines and then create a tab separated file with lines that exists in file1.txt to produce file3.txt ? 如何获取相应的行，然后使用制表符分隔文件，并使用file1.txt中存在的行来生成file3.txt ？

Note that only the first 10 character constitutes as the ID. 注意，只有前10个字符构成ID。 , there are cases like 0000001438warlaugh 世界 or even 0000001432231hahaha lol and only the 0000001438 and 0000001432 is the ID. ，例如0000001438warlaugh 世界甚至是0000001432231hahaha lol ，只有0000001438和0000001432是ID。

I tried with python, getfile3.py : 我尝试使用python getfile3.py ：

import io
f1 = {line[:10]:line[10:].strip() for line in io.open('file1.txt', 'r', encoding='utf8')}
f2 = {line[:10]:line[10:].strip() for line in io.open('file1.txt', 'r', encoding='utf8')}

f3 = io.open('file3.txt', 'w', encoding='utf8') 

for i in f1:
  f3.write(u"{}\t{}\n".format(f1[i], f2[i]))

But is there a bash/awk/grep/perl command-line way that i can get file3.txt ? 但是是否有bash / awk / grep / perl命令行方式可以获取file3.txt ？

Answer 1

awk '
{ key = substr($0,1,10); data = substr($0,11) }
NR==FNR { file1[key] = data; next }
key in file1 { print file1[key] data }
' file1 file2

You could use FIELDWIDTHS with GNU awk rather than substr() if you prefer. 如果愿意，可以将FIELDWIDTHS与GNU awk结合使用，而不要与substr（）结合使用。

Answer 2

Super long Perl answer: 超长Perl答案：

use warnings;
use strict;

# add files here as needed
my @input_files = qw(file1.txt file2.txt);
my $output_file = 'output.txt';

# don't touch anything below this line
my @output_lines = parse_files(@input_files);

open (my $output_fh, ">", $output_file) or die;
foreach (@output_lines) {
    print $output_fh "$_\n";                    #print to output file
    print "$_\n";                               #print to console
}
close $output_fh;

sub parse_files {
    my @input_files = @_;                       #list of text files to read.
    my %data;                                   #will store $data{$index} = datum1 datum2 datum3

    foreach my $file (@input_files) {           
        open (my $fh, "<", $file) or die;       
        while (<$fh>) { 
            chomp;                              
            if (/^(\d{10})\s?(.*)$/) {
                my $index = $1;
                my $datum = $2;
                if (exists $data{$index}) {
                    $data{$index} .= "\t$datum";
                } else {
                    $data{$index} = $datum;
                } #/else
            } #/if regex found
        } #/while reading current file
        close $fh;
    } #/foreach file

    # Create output array
    my @output_lines;
    foreach my $key (sort keys %data) {
        push (@output_lines, "$data{$key}");
    } #/foreach

    return @output_lines;
} #/sub parse_files

如何从前10个字符的标题编号链接的两个文本文件中提取行？

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-02-02 19:48:59

解决方案2
0 2015-02-04 22:20:24

如何从前10个字符的标题编号链接的两个文本文件中提取行？

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-02-02 19:48:59

解决方案2 0 2015-02-04 22:20:24

解决方案1
3 已采纳 2015-02-02 19:48:59

解决方案2
0 2015-02-04 22:20:24