简体   繁体   English

Bash脚本编译特定的csv行

[英]Bash Scripting compliling specific csv rows

I'm another bash scripting newbie (having just discovered it, it blew my mind! It's so exciting) What I want to do is have a script that compiles a LOT of .csv files into just one bigfile.csv, removing the headers, and inserting my own header. 我是另一位bash脚本编写新手(刚刚发现它,这让我震惊!这太令人兴奋了)。我想要做的是编写一个脚本,将很多.csv文件编译成一个bigfile.csv,删除标题,并插入我自己的标题 I discovered the following solution: 我发现以下解决方案:

awk 'FNR > 1' *.csv > bigfile.csv
sed -i 1i"Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..." bigfile.csv

Great! 大! But when I try and use this file for analysis I'm getting errors because of bad lines. 但是,当我尝试使用此文件进行分析时,由于线条不良,我会收到错误消息。 I had a look at it and indeed, there are a few crazy entries in there. 我看了一下,确实有一些疯狂的条目。

Luckily, every row that I want from the original .csv files has the entry "MD" the first column. 幸运的是,我要从原始.csv文件中获得的每一行在第一列中都有条目“ MD”。 So does anyone know how I can tell awk to only take the lines form the .csv files that have "MD" in their first cell ? 那么,有谁知道我如何告诉awk 仅采用第一个单元格中带有“ MD”的.csv文件中的行

EDIT: Thanks for your help guys, it worked a charm! 编辑:感谢您的帮助,它起到了很大的作用! Unfortunately there's still some weird data in there 不幸的是,那里仍然有一些奇怪的数据

CParserError: Error tokenizing data. C error: Expected 51 fields in line 6589, saw 54

With a simple adjustment, is there a way to again only take lines with 51 fields? 通过简单的调整,是否有办法再次只使用51个场的线?

I'm going to go out on a limb here and assume that the line you're adding with sed is actually the headers that you're stripping off. 我将在这里展开讨论,并假设您使用sed添加的行实际上是您要剥离的标头。

If that's the case, I'd suggest you skip the sed line, and just tell awk to strip the first line on files that are not the first one. 如果是这种情况,我建议您跳过sed行,而只是告诉awk在不是第一行的文件上删除第一行。

Next, if you only want lines containing the text MD in the first field, you can test that with a simple regex. 接下来,如果只希望在第一个字段中包含MD文本的行,则可以使用简单的正则表达式进行测试。

awk -F, '
    FNR==1 && NR > 1 { next }  # skip the header on all but the first file
    NF != 51 { next }          # skip this line if field count is wrong
    $1 ~ /MD/                  # print the line if the first field matches
' *.csv > /path/to/outputfile.csv
  • The -F, option tells awk to split fields using a comma as field separator. -F,选项告诉awk使用逗号作为字段分隔符来拆分字段。
  • NR is the total number of records processed, while FNR is the current record number in the current file. NR是已处理记录的总数,而FNR是当前文件中的当前记录号。
  • A condition with no commands assumes print as the command (printing the current line). 没有命令的条件假定将print作为命令(打印当前行)。

You can of course put this entire awk script on one line if you like. 当然,您可以根据需要将整个awk脚本放在一行上。 I split it out for easier reading. 我将其拆分以便于阅读。

If your outputfile.csv is in the same directory where you are getting your "glob" of input csv files, then be aware that the new file will be created by the shell, not by awk, and might also be processed as an input file. 如果您的outputfile.csv在获取输入csv文件的“全局”所在的目录中,那么请注意,新文件将由Shell创建,而不是由awk创建,并且也可能会作为输入文件处理。 This could be a concern if you were planning to append your redirect to an existing file with >> . 如果您打算使用>>将重定向重定向到现有文件,则可能会引起关注。

UPDATE 更新

As you've mentioned that the headers you're adding are different from the ones you strip off, you can still avoid using a separate command like sed, by changing the awk script to something like this: 正如您已经提到的,要添加的标头与剥离的标头不同,您仍然可以通过将awk脚本更改为以下内容来避免使用诸如sed之类的单独命令:

awk -F, '
    BEGIN {
      print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."
    }
    FNR==1 { next }            # skip the header on all files
    NF != 51 { next }          # skip this line if field count is wrong
    $1 ~ /MD/                  # print the line if the first field matches
' *.csv > /path/to/outputfile.csv

Commands within awk's BEGIN block are executed before any input lines are processed, so if you print new headers there, they will appear at the beginning of your (redirected) output. 在处理任何输入行之前,将执行awk的BEGIN块中的命令,因此,如果在那里打印新的标题,它们将出现在(重定向的)输出的开头。 (Note that there is a similar END block if you want to generate a footer/summary/etc after all input has been processed.) (请注意,如果要在处理END所有输入之后生成页脚/摘要/等,则有一个类似的END块。)

awk 'BEGIN{print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."}
     if(FNR > 1){print}' *.csv > bigfile.csv

FNR resets after each file that awk process, but NR doesn't and NR=FNR only for the first file. 在awk处理的每个文件之后, FNR都会重置,但是NR不会,并且NR=FNR仅用于第一个文件。


A small Illustration (of course with my test data) 一个小插图 (当然还有我的测试数据)

$ cat f1
Name,Roll
A,10
B,5
5$ cat f2
Name,Roll
C,56
D,44
$ awk 'BEGIN{print "Naam,RollNo"}FNR > 1{print}' f*>final
$ cat final 
Naam,RollNo
A,10
B,5
C,56
D,44

Note 注意

As you could see, the new header for the final file went to awk BEGIN section which get executed only at the beginning. 如您所见,最终文件的新标头进入了awk BEGIN部分,该部分仅在开始时执行。


Coming to your objective 达成目标

Every row that I want from the original .csv files has the entry "MD" the first column 我要从原始.csv文件中获得的每一行的第一列均具有条目“ MD”

awk 'BEGIN{FS=",";print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."}
     if(FNR > 1 && $1 == "MD" && NF == 51){print}' *.csv > bigfile.csv

Notes 笔记

This one has few differences from the first general case. 这与第一个一般情况几乎没有区别。

  • It introduces , as the field seperator 据介绍,作为该领域分隔符
  • FNR > 1 && $1 == "MD" means hey I don't want the header(FNR=1) and print stuff only when first field is MD($1 == "MD") and the number of fields is 51(NF == 51) FNR > 1 && $1 == "MD"表示仅当第一个字段为MD($ 1 ==“ MD”)并且字段数为51(NF)时,我才不要标题(FNR = 1)并打印内容== 51)

The Idiomatic way 惯用方式

As [ @ghoti ] mentioned in his comment : 正如[@ghoti]在他的评论中提到的:

awk's "default" command is already {print} awk的“默认”命令已经{print}

So the above script may be re-written as : 因此,以上脚本可能会重写为:

awk 'BEGIN{FS=",";print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."}
         (FNR > 1 && NF == 51 && $1 == "MD")' *.csv > bigfile.csv

A fancy one-liner would like:- 花式的一线会喜欢:-

awk -F',' 'NR > 1 && $1 ~ /^MD/ && NF == 51 { print }' *.csv > /someotherpath/bigfile.csv

A proper way with the complete bash script would be something like instead of fancy one-liners:- 使用完整的bash脚本的正确方法将类似于以下内容:

#!/bin/bash

# Am assuming the the '.csv' files are a single ',' separated 

for i in *.csv; do
    [ -e "$i" ] || continue    # To handle when no input *.csv files present
    awk -F',' 'NR > 1 && $1 ~ /^MD/ && NF == 51  { print }' "$i" > /someotherpath/bigfile.csv
done

The crux of the solution is using awk 's NR & NF variables, which keeps track of the current row and the nth field within the row, so ideally NR > 1 would skip the header part from being processed and $1 ~ /^MD/ returns only the lines in the file whose first column starts with the pattern and NF ==51 prints those lines containing exactly 51 fields. 解决方案的关键是使用awkNRNF变量,该变量跟踪当前行和该行中的nth字段,因此理想情况下NR > 1将跳过标头部分,并跳过$1 ~ /^MD/仅返回文件中第一行以模式开头的行,并且NF ==51打印包含正好51个字段的行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM