简体   繁体   English

我们可以使用AWK和gsub()处理带有多个冒号“:”的数据吗? 怎么样?

[英]Can we use AWK and gsub() to process data with multiple colons “:” ? How?

Here is an example of the data: 这是数据示例:

Col_01:14 .... Col_20:25    Col_21:23432    Col_22:639142
Col_01:8  .... Col_20:25    Col_22:25134    Col_23:243344
Col_01:17 .... Col_21:75    Col_23:79876    Col_25:634534    Col_22:5    Col_24:73453
Col_01:19 .... Col_20:25    Col_21:32425    Col_23:989423
Col_01:12 .... Col_20:25    Col_21:23424    Col_22:342421    Col_23:7    Col_24:13424    Col_25:67
Col_01:3  .... Col_20:95    Col_21:32121    Col_25:111231

As you can see, some of these columns are not in the correct order... 如您所见,其中某些列的顺序不正确...

Now, I think the correct way to import this file into a dataframe is to preprocess the data such that you can output a dataframe with NaN values, eg 现在,我认为将文件导入数据框的正确方法是对数据进行预处理,以便可以输出具有NaN值的数据框,例如

Col_01 .... Col_20    Col_21    Col22    Col23    Col24    Col25
8      .... 25        NaN       25134    243344   NaN      NaN
17     .... NaN       75        2        79876    73453    634534
19     .... 25        32425     NaN      989423   NaN      NaN
12     .... 25        23424     342421   7        13424    67
3      .... 95        32121     NaN      NaN      NaN      111231

The solution was shown by @JamesBrown here : How to preprocess and load a "big data" tsv file into a python dataframe? @JamesBrown在此处显示了该解决方案: 如何预处理“大数据” tsv文件并将其加载到python数据框中?

Using said awk script: 使用上述awk脚本:

BEGIN {
    PROCINFO["sorted_in"]="@ind_str_asc" # traversal order for for(i in a)                  
}
NR==1 {       # the header cols is in the beginning of data file
              # FORGET THIS: header cols from another file replace NR==1 with NR==FNR and see * below
    split($0,a," ")                  # mkheader a[1]=first_col ...
    for(i in a) {                    # replace with a[first_col]="" ...
        a[a[i]]
        printf "%6s%s", a[i], OFS    # output the header
        delete a[i]                  # remove a[1], a[2], ...
    }
    # next                           # FORGET THIS * next here if cols from another file UNTESTED
}
{
    gsub(/: /,"=")                   # replace key-value separator ": " with "="
    split($0,b,FS)                   # split record from ","
    for(i in b) {
        split(b[i],c,"=")            # split key=value to c[1]=key, c[2]=value
        b[c[1]]=c[2]                 # b[key]=value
    }
    for(i in a)                      # go thru headers in a[] and printf from b[]
        printf "%6s%s", (i in b?b[i]:"NaN"), OFS; print ""
}

And put the headers into a text file cols.txt 并将标题放入文本文件cols.txt

Col_01 Col_20 Col_21 Col_22 Col_23 Col_25

My question now: how do we use awk if we have data that is not column: value but column: value1: value2: value3 ? 我现在的问题是:如果我们拥有的数据不是column: value而是column: value1: value2: value3 ,我们该如何使用awk?

We would want the database entry to be value1: value2: value3 我们希望数据库条目为value1: value2: value3

Here's the new data: 这是新数据:

Col_01:14:a:47 .... Col_20:25:i:z    Col_21:23432:6:b    Col_22:639142:4:x
Col_01:8:z .... Col_20:25:i:4    Col_22:25134:u:0    Col_23:243344:5:6
Col_01:17:7:z .... Col_21:75:u:q    Col_23:79876:u:0    Col_25:634534:8:1   

We still provide the columns beforehand with cols.txt 我们仍然预先为列提供cols.txt

How can we create a similar database structure? 我们如何创建类似的数据库结构? Is it possible to use gsub() to limit to the first value before : which is the same as the header? 是否可以使用gsub()限制为:与标题相同的第一个值?

EDIT: This doesn't have to be awk based. 编辑:这并不一定要基于AWK。 Any language will do naturally 任何语言都会自然而然地

Here is another alternative... 这是另一种选择...

$ awk -v OFS='\t' '{for(i=1;i<NF;i+=2)                  # iterate over name: value pairs
                     {c=$i;                             # copy name in c to modify
                      sub(/:/,"",c);                    # remove colon
                      a[NR,c]=$(i+1);                   # collect data by row number, name
                      cols[c]}}                         # save name
                END{n=asorti(cols,icols);               # sort names
                    for(j=1;j<=n;j++) printf "%s", icols[j] OFS;   # print header 
                    print ""; 
                    for(i=1;i<=NR;i++)                  # print data
                      {for(j=1;j<=n;j++) 
                         {v=a[i,icols[j]];             
                          printf "%s", (v?v:"NaN") OFS} # replace missing data with NaN
                       print ""}}' file | column -t     # pipe to column for pretty print

Col_01   Col_20  Col_21     Col_22      Col_23      Col_25
14:a:47  25:i:z  23432:6:b  639142:4:x  NaN         NaN
8:z      25:i:4  NaN        25134:u:0   243344:5:6  NaN
17:7:z   NaN     75:u:q     NaN         79876:u:0   634534:8:1

I had karakfa's answer as well. 我也有karakfa的回答。 If the column name is not separated by whitespace from the value (eg if you have Col_01:14:a:47 ) then you can do this (using GNU awk for the extended match function) 如果列名与值之间没有用空格分隔(例如,如果您具有Col_01:14:a:47 ),则可以执行此操作(将GNU awk用于扩展match功能)

  {
      for (i=1; i<=NF; i++) {
          match($i, /^([^:]+):(.*)/, m)
          a[NR,m[1]] = m[2]
          cols[m[1]]
     }
  }

The END block is the same END块相同

Using TXR 's Lisp macro implementation of the Awk paradigm: 使用TXR的Lisp 宏实现 Awk范例:

(awk (:set ft #/-?\d+/)  ;; ft is "field tokenize" (no counterpart in Awk)
     (:let (tab (hash :equal-based)) (max-col 1) (width 8))
     ((ff (mapcar toint) (tuples 2))  ;; filter fields to int and shore up into pairs
      (set max-col (max max-col [find-max [mapcar first f]]))
      (mapdo (ado set [tab ^(,nr ,@1)] @2) f)) ;; stuff data into table
     (:end (let ((headings (mapcar (opip (format nil "Col~,02a")
                                         `@{@1 width}`)
                                   (range 1 max-col))))
             (put-line `@{headings " "}`))
           (each ((row (range 1 nr)))
             (let ((cols (mapcar (opip (or [tab ^(,row ,@1)] "NaN")
                                       `@{@1 width}`)
                                 (range 1 max-col))))
               (put-line `@{cols " "}`)))))

Smaller sample data: 较小的样本数据:

Col_01: 14  Col_04: 25    Col_06: 23432    Col_07: 639142
Col_02: 8   Col_03: 25    Col_05: 25134    Col_06: 243344
Col_01: 17
Col_06: 19  Col_07: 32425

Run: 跑:

$ txr reformat.tl data-small
Col01    Col02    Col03    Col04    Col05    Col06    Col07
14       NaN      NaN      25       NaN      23432    639142
NaN      8        25       NaN      25134    243344   NaN
17       NaN      NaN      NaN      NaN      NaN      NaN
NaN      NaN      NaN      NaN      NaN      19       32425

PS opip is a macro which boostraps from the op macro for partial function applications; PS opip是一个宏,它针对部分功能应用而从op宏进行增强。 opip implicitly distributes op into its argument expressions, and then chains the resulting functions together into a functional pipeline: hence " op -pipe". opipop隐式分配到其参数表达式中,然后将结果函数链接到一个功能管线中:因此是“ op -pipe”。 In each pipeline element, its own numbered implicit arguments can be referenced: @1 , @2 , ... if they are absent, then the partially applied function implicitly receives the piped object as its rightmost argument. 在每个管道元素中,都可以引用其自己编号的隐式参数: @1@2 ,...如果不存在,则部分应用的函数会隐式接收管道对象作为其最右边的参数。

The ^(,row ,@1) syntax is TXR Lisp's backquote. ^(,row ,@1)语法是TXR Lisp的反引号。 The backtick that mainstream Lisp dialects use for backquote is already employed for string quasiquotes. 主流Lisp方言用于反引号的反引号已被用于字符串准引号。 This is equivalent to (list row @1) : make a list consisting of the value of row and of the implicit, op/do -generated function argument @1 . 这等效于(list row @1) :创建一个由row值和隐式op/do生成的函数参数@1 Lists of two elements are being used as the hash keys, which simulates a 2D array. 两个元素的列表用作哈希键,可模拟2D数组。 For that, the hash must be :equal-based . 为此,哈希必须:equal-based The lists (1 2) (1 2) are not eql if they are separate instances rather than one and the same object; 如果列表(1 2) (1 2)是单独的实例而不是一个相同的对象,则它们不是eql they compare equal under the equal function. 它们在equal函数下比较相等。

Just for fun, some incomprehensible perl 只是为了好玩,一些不可理解的Perl

perl -aE'%l=%{{@F}};while(($k,$v)=each%l){$c{$k}=1;$a[$.]{$k}=$v}END{$,="\t";say@c=sort keys%c;for$i(1..$.){say map{$a[$i]{$_}//"NaN"}@c}}' input

(community wiki to hide my shame ...) (社区维基,以掩饰我的耻辱...)

Golfed a few chars: 打了几个字符:

perl -aE'while(@F){$c{$k=shift@F}=1;$data[$.]{$k}=shift@F}END{$,="\t";say@c=sort keys%c;for$i(1..$.){say map{$data[$i]{$_}//"NaN"}@c}}' input

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM