简体   繁体   English

使用Bash脚本选择具有特定名称的列和行

[英]Use Bash scripting to select columns and rows with specific name

I'm working with a very large text file (4GB) and I want to make a smaller file with only the data I need in it. 我正在使用一个非常大的文本文件(4GB),并且我想使用仅需要的数据制作一个较小的文件。 It is a tab deliminated file and there are row and column headers. 它是制表符分隔的文件,并且具有行标题和列标题。 I basically want to select a subset of the data that has a given column and/or row name. 我基本上想选择具有给定列和/或行名称的数据子集。

     colname_1    colname_2    colname_3    colname_4
row_1    1            2             3            5
row_2    4            6             9            1
row_3    2            3             4            2

I'm planning to have a file with a list of the columns I want. 我计划有一个包含我想要的列列表的文件。

colname_1    colname_3

I'm a newbie to bash scripting and I really don't know how to do this. 我是bash脚本的新手,我真的不知道该怎么做。 I saw other examples, but they all new what column number they wanted in advance and I don't. 我看到了其他示例,但是它们都提供了他们预先想要的列号,而我没有。 Sorry if this is a repeat question, I tried to search. 抱歉,如果这是重复问题,我尝试搜索。

I would want the result to be 我希望结果是

     colname_1     colname_3
row_1    1             3
row_2    2             9
row_3    2             4 

You can actually do this by keeping track of the array indexes for the columns that match the column names in your file containing the column list . 实际上,您可以通过跟踪与包含列列表的文件中的 名称匹配的的数组索引来做到这一点。 After you have found the array indexes in the data file for the column names in your column list file, you simply read your data file (beginning at the second line) and output the row_label plus the data for the columns at the array index you determined in matching the column list file to the original columns. 在数据文件中找到列列表文件中列名称的数组索引后,您只需读取数据文件(从第二行开始),然后输出row_label以及确定的数组索引处的列数据将列列表文件与原始列进行匹配。

There are probably several ways to approach this and the following assumes the data in each column does not contain any whitespace. 可能有几种方法可以解决此问题,以下假设每列中的数据不包含任何空格。 The use of arrays presumes bash (or other advanced shell supporting arrays) and not POSIX shell. 数组的使用假定为bash(或其他高级shell支持数组),而不是POSIX shell。

The script takes two file names as input. 该脚本将两个文件名作为输入。 The first is your original data file. 第一个是您的原始数据文件。 The second is your column list file. 第二个是您的列列表文件。 An approach could be: 一种方法可以是:

#!/bin/bash

declare -a cols  ## array holding original columns from original data file
declare -a csel  ## array holding columns to select (from file 2)
declare -a cpos  ## array holding array indexes of matching columns

cols=( $(head -n 1 "$1") )  ## fill cols from 1st line of data file
csel=( $(< "$2") )          ## read select columns from file 2

## fill column position array
for ((i = 0; i < ${#csel[@]}; i++)); do
    for ((j = 0; j < ${#cols[@]}; j++)); do
        [ "${csel[i]}" = "${cols[j]}" ] && cpos+=( $j )
    done
done

printf " " 
for ((i = 0; i < ${#csel[@]}; i++)); do   ## output header row
    printf "    %s" "${csel[i]}"
done

printf "\n"     ## output newline
unset cols      ## unset cols to reuse in reading lines below

while read -r line; do        ## read each data line in data file 
    cols=( $line )            ## separate into cols array
    printf "%s" "${cols[0]}"  ## output row label
    for ((j = 0; j < ${#cpos[@]}; j++)); do
        [ "$j" -eq "0" ] && { ## handle format for first column
            printf "%5s" "${cols[$((${cpos[j]}+1))]}"
            continue
        }                     ## output remaining columns
        printf "%13s" "${cols[$((${cpos[j]}+1))]}"
    done
    printf "\n"
done < <( tail -n+2 "$1" )

Using your example data as follows: 使用示例数据,如下所示:

Data File 资料档案

$ cat dat/col+data.txt
     colname_1    colname_2    colname_3    colname_4
row_1    1            2             3            5
row_2    4            6             9            1
row_3    2            3             4            2

Column Select File 列选择文件

$ cat dat/col.txt
colname_1    colname_3

Example Use/Output 使用/输出示例

$ bash colnum.sh dat/col+data.txt dat/col.txt
     colname_1    colname_3
row_1    1            3
row_2    4            9
row_3    2            4

Give it a try and let me know if you have any questions. 试试看,如果您有任何疑问,请告诉我。 Note, bash isn't known for its blinding speed handling large files, but as long as the column list isn't horrendously long, the script should be reasonably fast. 请注意,bash以处理大型文件的盲目速度而闻名,但只要列列表的长度不可怕,脚本就应该相当快。

Bash works best as "glue" between standard command-line utilities. Bash在标准命令行实用程序之间的“胶水”效果最佳。 You can write loops which read each line in a massive file, but it's painfully slow because bash is not optimized for speed. 可以编写循环来读取海量文件中的每一行,但由于bash并未针对速度进行优化,因此循环速度非常慢。 So let's see how to use a few standard utilities -- grep, tr, cut and paste -- to achieve this goal. 因此,让我们看看如何使用一些标准实用程序(grep,tr,剪切和粘贴)来实现此目标。

For simplicity, let's put the desired column headings into a file, one per line. 为简单起见,让我们将所需的列标题放入文件中,每行一个。 (You can always convert a tab-separated line of column headings to this format; we're going to do just that with the data file's column headings. But one thing at a time.) (您总是可以将制表符分隔的列标题行转换为这种格式;我们将只使用数据文件的列标题来做到这一点。但是一次只能做一件事。)

$ printf '%s\n' colname_{1,3} > columns
$ cat columns
colname_1
colname_2

An important feature of the printf command-line utility is that it repeats its format until it runs out of arguments. printf命令行实用程序的一个重要功能是,它重复其格式,直到用完参数为止。

Now, we want to know which column in the data file each of these column headings corresponds to. 现在,我们想知道这些列标题中的每个对应于数据文件中的哪一列。 We could try to write this as a loop in awk or even in bash, but if we convert the header line of the data file into a file with one header per line, we can use grep to tell us, by using the -n option (which prefixes the output with the line number of the match). 我们可以尝试将其写为awk甚至bash中的循环,但是如果将数据文件的标题行转换为每行一个标题的文件,则可以使用-n选项使用grep告诉我们(在输出的前面加上匹配项的行号)。

Since the column headers are tab-separated, we can get turn them into separate lines just by converting tabs to newlines using tr : 由于列标题是制表符分隔的,因此我们可以使用tr将制表符转换为换行符,从而将它们转换为单独的行:

$ head -n1 giga.dat | tr '\t' '\n'

colname_1
colname_2
colname_3
colname_4

Note the blank line at the beginning. 请注意开头的空白行。 That's important, because colname_1 actually corresponds to column 2, since the row headers are in column 1. 这很重要,因为colname_1实际上对应于第2列,因为行标题位于第1列中。

So let's look up the column names. 因此,让我们查找列名。 Here, we will use several grep options: 在这里,我们将使用几个grep选项:

  • -F The pattern argument consists of several patterns, one per line, which are interpreted as ordinary strings instead of regexes. -F模式参数由几种模式组成,每行一种,被解释为普通字符串而不是正则表达式。
  • -x The pattern must match the complete line. -x模式必须与整行匹配。
  • -n The output should be prefixed by the line number of the match. -n输出应以匹配的行号为前缀。

If we have Gnu grep , we could also use -f columns to read the patterns from the file named columns . 如果我们有Gnu grep ,我们也可以使用-f columns从名为columns的文件中读取模式。 Or if we're using bash, we could use the bashism "$(<columns)" to insert the contents of the file as a single argument to grep. 或者,如果我们使用bash,则可以使用bashism "$(<columns)"将文件的内容作为grep的单个参数插入。 But for now, we'll stay Posix compliant: 但是现在,我们将保持与Posix兼容:

$ head -n1 giga.dat | tr '\t' '\n' | grep -Fxn "$(cat columns)"
2:colname_1
4:colname_3

OK, that's pretty close. 好,那很接近。 We just need to get rid of everything other than the line number; 我们只需要除去行号以外的所有内容即可; comma-separate the numbers, and put a 1 at the beginning. 以逗号分隔数字,并在开头放置1。

 $ { echo 1
 >   grep -Fxn "$(<columns)" < <(head -n1 giga.dat | tr '\t' '\n')
 > } | cut -f1 -d: | paste -sd,
 1,2,4
  • cut -f1 Select field 1. The argument could be a comma-separated list, as in cut -f1,2,4 . cut -f1选择字段1。自变量可以是逗号分隔的列表,如cut -f1,2,4
  • cut -d: Use : instead of tab as a field separator ("delimiter") cut -d:使用:代替制表符作为字段分隔符(“定界符”)
  • paste -s Concatenate the lines of a single file instead of corresponding lines of several files paste -s连接单个文件的行,而不是多个文件的相应行
  • paste -d, Use a comma instead of tab as a field separator. paste -d,使用逗号代替制表符作为字段分隔符。

So now we have the argument we need to pass to cut in order to select the desired columns: 因此,现在我们需要传递参数以cut以选择所需的列:

$ cut -f"$({ echo 1
>   head -n1 giga.dat | tr '\t' '\n' | grep -Fxn -f columns 
> } | cut -f1 -d: | paste -sd,)" giga.dat
        colname_1       colname_3
row_1   1       3
row_2   4       9
row_3   2       4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM