简体   繁体   English

如何从前10个字符的标题编号链接的两个文本文件中提取行?

[英]How to extract lines from two textfiles linked by heading number from the 1st 10 characters?

I have two files: 我有两个文件:

file1.txt : file1.txt

0000001435 XYZ 与 ABC
0000001438warlaugh 世界

file1.txt : file1.txt

0000001435 XYZ with abc
0000001436 DFC whatever
0000001437 FBFBBBF
0000001438 world of warlaugh

The lines in the separated file are linked by the number (1st 10 characters). 分隔文件中的行由数字链接(第1个10个字符)。 The desired output is a tab separated file with lines that exists and file1.txt and the corresponding lines from file2.txt : 所需的输出是一个制表符分隔的文件,其中包含存在的行和file1.txt以及来自file2.txt的相应行:

file3.txt : file3.txt

XYZ 与 ABC   XYZ with abc
warlaugh 世界 world of warlaugh

How do I get the corresponding lines and then create a tab separated file with lines that exists in file1.txt to produce file3.txt ? 如何获取相应的行,然后使用制表符分隔文件,并使用file1.txt中存在的行来生成file3.txt

Note that only the first 10 character constitutes as the ID. 注意,只有前10个字符构成ID。 , there are cases like 0000001438warlaugh 世界 or even 0000001432231hahaha lol and only the 0000001438 and 0000001432 is the ID. ,例如0000001438warlaugh 世界甚至是0000001432231hahaha lol ,只有00000014380000001432是ID。

I tried with python, getfile3.py : 我尝试使用python getfile3.py

import io
f1 = {line[:10]:line[10:].strip() for line in io.open('file1.txt', 'r', encoding='utf8')}
f2 = {line[:10]:line[10:].strip() for line in io.open('file1.txt', 'r', encoding='utf8')}

f3 = io.open('file3.txt', 'w', encoding='utf8') 

for i in f1:
  f3.write(u"{}\t{}\n".format(f1[i], f2[i]))

But is there a bash/awk/grep/perl command-line way that i can get file3.txt ? 但是是否有bash / awk / grep / perl命令行方式可以获取file3.txt

awk '
{ key = substr($0,1,10); data = substr($0,11) }
NR==FNR { file1[key] = data; next }
key in file1 { print file1[key] data }
' file1 file2

You could use FIELDWIDTHS with GNU awk rather than substr() if you prefer. 如果愿意,可以将FIELDWIDTHS与GNU awk结合使用,而不要与substr()结合使用。

Super long Perl answer: 超长Perl答案:

use warnings;
use strict;

# add files here as needed
my @input_files = qw(file1.txt file2.txt);
my $output_file = 'output.txt';

# don't touch anything below this line
my @output_lines = parse_files(@input_files);

open (my $output_fh, ">", $output_file) or die;
foreach (@output_lines) {
    print $output_fh "$_\n";                    #print to output file
    print "$_\n";                               #print to console
}
close $output_fh;

sub parse_files {
    my @input_files = @_;                       #list of text files to read.
    my %data;                                   #will store $data{$index} = datum1 datum2 datum3

    foreach my $file (@input_files) {           
        open (my $fh, "<", $file) or die;       
        while (<$fh>) { 
            chomp;                              
            if (/^(\d{10})\s?(.*)$/) {
                my $index = $1;
                my $datum = $2;
                if (exists $data{$index}) {
                    $data{$index} .= "\t$datum";
                } else {
                    $data{$index} = $datum;
                } #/else
            } #/if regex found
        } #/while reading current file
        close $fh;
    } #/foreach file

    # Create output array
    my @output_lines;
    foreach my $key (sort keys %data) {
        push (@output_lines, "$data{$key}");
    } #/foreach

    return @output_lines;
} #/sub parse_files

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何根据字符串的 id 从给定的文本文件中提取字符串的第一个、第二个和最后一个实例? - How to extract 1st, 2nd and last instance of a string from a given text file based on their ids? 如何提取一个函数的参数并将其用于另一个使用与第一个函数有2个不同结果的函数中? - How to extract parameters of a function and use them in another function that uses 2 different results from the 1st function? 如何使用函数从字符串中提取第一个、第二个和最后一个单词? - How to extract 1st, 2nd and last words from the string using functions? 如何从每个json文件的第一行中删除前几个字符 - How to remove first few characters from every 1st line of each json file 一个数字加上两个字符的正则表达式,如“1st”、“2nd”、“10th”、“22nd”? - Regex for a digit plus two characters like '1st', '2nd', '10th', '22nd'? 如何从python中的元组中提取数字和字符 - How to extract number and characters from a tuple in python 如何从一团文本而不是字符中提取行? - How to extract lines from a blob of text instead of characters? 如何在Python中从串行端口的接收数据中分离出第一个字节 - how to separate 1st byte from receiveddata of serial port in python 如何从pyqt的第一个窗口打开第二个窗口? - How to open second window from 1st window in pyqt? 如何从csv中读取第一列并分离成多维数组 - How to read 1st column from csv and separate into multidimensional array
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM