简体   繁体   English

从字典中提取数据

[英]extracting data from a dictionary

I have two tab delimited files, file 1 contains identifiers and file 2 has values related to these identifiers (or say it is a very big dictionary). 我有两个制表符分隔的文件,文件1包含标识符,文件2具有与这些标识符相关的值(或者说这是一个很大的字典)。

file 1 文件1

Ronny
Rubby
Suzie
Paul

file 1 has only one column. 文件1只有一列。

file 2 文件2

Alistar Barm Cathy Paul Ronny Rubby Suzie Tom Uma Vai Zai
12      13    14   12     11   11   12    23 30  0.34 0.65
1       4     56   23     12   8.9  5.1   1  4    25  3

n number of rows are present in file 2. 文件2中存在n行。

what I want, if the identifiers of file 1 are present in file 2, I should have all the values related to it in an another tab delimited file. 我想要的是,如果文件1的标识符存在于文件2中,则我应该在另一个制表符分隔的文件中拥有与之相关的所有值。

Something like this: 像这样:

Paul Ronny Rubby Suzie
12     11   11   12
23     12   8.9  5.1

Thank you in advance. 先感谢您。

NOTE 注意

your example output is NOT correct, since there you have "Ruby" but in your file1 example you had "Rubby" Ruby =/= Rubby 您的示例输出不正确,因为那里有“ Ruby”,但在file1示例中却有“ Rubby” Ruby = / = Rubby

kent$  awk 'NR==FNR{t[$0]++;next}
{if(FNR==1){
        for(i=1;i<=NF;i++)
                if($i in t){
                        v[i]++;
                        printf $i"\t";
                }
        print "";
        }else{
        for(x in v)
                printf $x"\t"
        print "";
}

}' file1 file2

output 输出

Paul    Ronny   Suzie
12      11      12
23      12      5.1
$ awk 'FILENAME~1{a[$0];next};FNR==1{for(i=1;i<=NF;i++)if($i in a)b[i]};{for(j in b)printf("%s\t",$j);print ""}' file{1,2}.txt
Paul    Ronny   Suzie
12      11      12
23      12      5.1

break into multi lines && add whitespace 分成多行&&添加空格

$ awk '
> FILENAME~1 { a[$0]; next }
> FNR==1 { for(i=1; i<=NF; i++) if($i in a) b[i] }
> { for(j in b) printf("%s\t",$j); print ""}
> ' file{1,2}.txt

Paul    Ronny   Suzie
12      11      12
23      12      5.1

You can use only bash to do it: 您只能使用bash来做到这一点:

FIELDS=`head -1 f2.txt | tr "\t" "\n" | nl -ba | grep -f f1.txt | cut -f1 | tr -d " " | tr "\n" ","`; FIELDS=${FIELDS/%,/}
cut -f$FIELDS f2.txt 
Paul    Ronny   Ruby    Suzie
12  11  11  12
23  12  8.9 5.1

An example in Python that does the work in stream (ie: don't need to load the full file before starting the output): 在Python中进行流式处理的示例(即:在开始输出之前不需要加载完整文件):

# read keys
with open('file1', 'r') as fd:
    keys = fd.read().splitlines()

# output keys
print '\t'.join(keys)

# read data file, with header line and content
with open('file2', 'r') as fd:
    headers = fd.readline().split()
    while True:
        line = fd.readline().split()
        if len(line) == 0:
            break
        print '\t'.join([line[headers.index(x)] for x in keys if x in headers])

Output: 输出:

$ python test.py 
Ronny   Ruby    Suzie   Paul
11      11      12      12
12      8.9     5.1     23

Perl solution: Perl解决方案:

#!/usr/bin/perl
use warnings;
use strict;

open my $KEYS, '<', 'file1' or die $!;
my @keys = <$KEYS>;
close $KEYS;
chomp @keys;
my %is_key;
undef @is_key{@keys};

open my $TAB, '<', 'file2' or die $!;
$_ = <$TAB>;
my ($i, @columns);
for (split) {
    push @columns, $i if exists $is_key{$_};
    $i++;
}
do {{
    my @values = split;
    print join("\t", @values[@columns]), "\n";
}} while <$TAB>;

Something like this could probably work, depending on what you want. 根据您的需求,类似的事情可能会起作用。

use strict;
use warnings;

my %names;
open ( my $nh, '<', $name_file_path ) or die "Could not open '$name_file_path'!";
while ( <$nh> ) { 
    m/^\s*(.*?\S)\s*$/ and $names{ $1 } = 1; 
}
close $nh;

my $coln = -1;
open ( my $dh, '<', $data_file_path ) or die "Could not open '$data_file_path'!";

my ( @name_list, @col_list )
my @names = split /\t/, <$dh>;
foreach my $name ( 0..$#names ) {
    next unless exists $names{ $names[ $name ] };
    push @name_list, $name;
    push @col_list, $coln;
}
local $" = "\t";
print "@name_list\n";
print "@{[ split /\t/ ]}[ @col_list ]\n"  while <$dh>;
close $dh;

This might work for you: 这可能对您有用:

 sed '1{s/\t/\n/gp};d' file2 |
 nl |
 grep -f file1 |
 cut -f1 |
 paste -sd, |
 sed 's/ //g;s,.*,cut -f& /tmp/b,' |
 sh

Explanation: 说明:

  1. Pivot the column names 透视列名称
  2. Number the column names 编号列名称
  3. Match the column names against the input file. 使列名与输入文件匹配。
  4. Ditch the column names retaining the column numbers. 抛开保留列号的列名。
  5. Pivot the column numbers separating by , 's. 旋转以,分隔的列号。
  6. Build a cut command from the comma separated column number list. 从逗号分隔的列号列表中生成一个cut命令。
  7. Run the cut command against the data file. 对数据文件运行cut命令。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM