简体   繁体   English

使用bash,sed或awk拆分CSV文件并排除输出中的列

[英]Splitting CSV file and excluding column in output using bash, sed or awk

I have a CSV file which contains data like the following:- 我有一个CSV文件,其中包含如下数据: -

1,275,,,275,17.3,0,"2011-05-09 20:21:45"
2,279,,,279,17.3,0,"2011-05-10 20:21:52"
3,276,,,276,17.3,0,"2011-05-11 20:21:58"
4,272,,,272,17.3,0,"2011-05-12 20:22:04"
5,272,,,272,17.3,0,"2011-05-13 20:22:10"
6,278,,,278,17.3,0,"2011-05-13 20:24:08"
7,270,,,270,17.3,0,"2011-05-13 20:24:14"
8,269,,,269,17.3,0,"2011-05-14 20:24:20"
9,278,,,278,17.3,0,"2011-05-14 20:24:26"

This file contains 4432986 rows of data. 该文件包含4432986行数据。

I wish to split the file out basing the new file name on the date in the last column. 我希望在最后一列的日期基于新文件名拆分文件。

Therefore based on the data above i would want 6 new files with the rows for each day in each file. 因此,基于上面的数据,我希望每个文件中每天有6个新文件。

I would like the files named in YYYY_MM_DD format. 我想要以YYYY_MM_DD格式命名的文件。

I would also like to ignore the first column in the output data 我还想忽略输出数据中的第一列

So file 2011_05_13 would contain the following rows, with the first column excluded:- 因此,文件2011_05_13将包含以下行,并排除第一列: -

272,,,272,17.3,0,"2011-05-13 20:22:10"
278,,,278,17.3,0,"2011-05-13 20:24:08"
270,,,270,17.3,0,"2011-05-13 20:24:14"

I am planning on doing this on a linux box, so anything using any linux utilities would be cool, sed awk etc ?? 我打算在linux机器上这样做,所以使用任何Linux实用程序的任何东西都会很酷,sed awk等?

Here's a one-liner for you in awk : 这是awk的单行代码:

awk -F "," '{ split ($8,array," "); sub ("\\"","",array[1]); sub (NR,"",$0); sub (",","",$0); print $0 > array[1] }' file.txt

Desired output achieved, although perhaps some of this code could be made more succinct. 实现了所需的输出,尽管这些代码中的某些代码可能会更加简洁。 HTH. HTH。

EDIT: 编辑:

Read code from left to right: 从左到右阅读代码:

  • -F ","
    Yes this sets the delimiter. 是的,这设置了分隔符。

  • split ($8,array," ")
    This splits the eighth column on the space and puts this info in an array called array . 这将拆分空间上的第八列,并将此信息放入一个名为array

  • sub ("\\"","",array[1])
    We take the first array element (this is a slice that's going to become our output file name) and substitute out the leading " symbol (We need to escape the " symbol so we put the \\ character in front). 我们取第一个数组元素(这是一个将成为我们的输出文件名的切片)并替换掉前导的"符号(我们需要转义"符号,因此我们将\\字符放在前面)。

  • sub (NR,"",$0)
    This conveniently removes the line number from the beginning of your file ( NR is row number and $0 is of course the whole line of input before delimitation). 这样可以方便地从文件开头删除行号( NR是行号, $0当然是分隔前的整行输入)。

  • sub (",","",$0)
    This removes the comma after the row number. 这将删除行号后面的逗号。

  • Now that we have a clean filename and a clean row of data we can write $0 to array[1] : print $0 > array[1] . 现在我们有一个干净的文件名和一行干净的数据,我们可以将$0写入array[1]print $0 > array[1]

FIX: 固定:

So if you'd prefer a underscore instead of a hypon, all we need to fix is array[1] . 因此,如果您更喜欢下划线而不是下划线,我们需要修复的只是array[1] I've just added in a global substitution: gsub ("-","_",array[1]) . 我刚刚添加了一个全局替换: gsub ("-","_",array[1])

The updated code is: 更新的代码是:

awk -F "," '{ split ($8,array," "); sub ("\\"","",array[1]); gsub ("-","_",array[1]); sub (NR,"",$0); sub (",","",$0); print $0 > array[1] }' file.txt

HTH. HTH。

You can use this awk command: 你可以使用这个awk命令:

awk -F, 'BEGIN{OFS=",";} {dt=$8; gsub(/^"| .*"$/,"", dt);
$1=""; sub(/^,/, "", $0); print $0 > dt}' input.txt

A scripting language (perl/python) is likely your best choice here, but I liked the challenge of doing this in bash, so here it is. 脚本语言(perl / python)可能是你最好的选择,但我喜欢在bash中做这个的挑战,所以在这里。

 cat bigfile.txt | while read LINE;
  do echo $LINE >> `echo $LINE | cut -d, -f8 | cut -c2-11`.txt ;
 done

Basically, what this does is reads the file line by line in the while loop, then appends that line to a file based on the date. 基本上,这样做是在while循环中逐行读取文件,然后根据日期将该行附加到文件。

The date is pulled out with a combination of two cut commands. 使用两个cut命令的组合拉出日期。 The first cut pulls the last column (column 8) off using a comma delimiter ( -d, ), then the second cut pulls just the date by removing the first " , and then slurping up to character 11. 第一个cut使用逗号分隔符( -d, )拉出最后一列(第8列),然后第二个cut通过删除第一个"拉出日期" ,然后向上拉到角色11。


Now, to tackle the removal of the first column: 现在,要解决第一列的删除问题:

cat bigfile.txt | sed 's/^.*?,//'

This regular expression just removes everything before the first comma. 这个正则表达式只删除第一个逗号之前的所有内容。

So, we'll replace the beginning of our while loop with this, leaving us with: 所以,我们将用这个替换while循环的开头,留下我们:

 cat bigfile.txt | sed 's/^.*?,//' | while read LINE;
  do echo $LINE >> `echo $LINE | cut -d, -f8 | cut -c2-11`.txt ;
 done

This monstrosity grabs all the unique dates and then greps for those keys in the original file saving them to files named by that key. 这个怪物抓住所有唯一的日期,然后greps原始文件中的那些键将它们保存到由该键命名的文件。 Yes, useless use of cat, but trying to atomize the actions. 是的,无用的猫,但试图雾化行动。

cat records.txt \
| cut -f8 -d, \
| cut -f1 -d ' ' \
| tr -d '"' \
| sort -u \
| while read DATE ; do \
    cat records.txt \
    | cut -f2- -d, \
    | egrep ",\"${DATE} [0-9]{2}:[0-9]{2}:[0-9]{2}\"" \
    > ${DATE}.txt
done

一定很简单

$ sed 's/^[0-9]*,//' your_gigantic_data.csv

This might work for you: 这可能对你有用:

sed 's/^[^,]*,\(.*"\(....\)-\(..\)-\(..\).*\)/echo \1 >>\2_\3_\4.csv/' file | sh

or GNU sed: 或GNU sed:

sed 's/^[^,]*,\(.*"\(....\)-\(..\)-\(..\).*\)/echo \1 >>\2_\3_\4.csv/e' file

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM