简体   繁体   English

Uniqing基于字段子集的分隔文件

[英]Uniqing a delimited file based on a subset of fields

I have data such as below: 我有如下数据:

1493992429103289,207.55,207.5
1493992429103559,207.55,207.5
1493992429104353,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5

Due to the nature of the last two columns, their values change throughout the day and their values are repeated regularly. 由于最后两列的性质,它们的值在一整天都在变化,它们的值会定期重复。 By grouping the way outlined in my desired output (below), I am able to view each time there was a change in their values (with the enoch time in the first column). 通过对我所需输出(下面)中概述的方式进行分组,我可以在每次值发生变化时查看(第一列中的enoch时间)。 Is there a way to achieve the desired output shown below: 有没有办法实现下面显示的所需输出:

1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5

So I consolidate the data by the second two columns. 所以我通过后两列整合数据。 However, the consolidation is not completely unique (as can be seen by 207.55, 207.5 being repeated) 然而,整合并不是完全独特的(可以看出207.55,207.5重复)

I have tried: 我试过了:

uniq -f 1

However the output gives only the first line and does not go on through the list 但是,输出仅提供第一行,而不会通过列表继续

The awk solution below does not allow the occurrence which happened previously to be outputted again and so gives the output (below the awk code): 下面的awk解决方案不允许先前发生的事件再次输出,因此给出输出(在awk代码下面):

awk '!x[$2 $3]++'

1493992429103289,207.55,207.5
1493992429104491,207.6,207.55

I do not wish to sort the data by the second two columns. 我不希望通过后两列对数据进行排序。 However, since the first is epoch time, it may be sorted by the first column. 但是,由于第一个是纪元时间,因此可以按第一列进行排序。

You can use an Awk statement as below, 你可以使用如下的Awk语句,

awk 'BEGIN{FS=OFS=","} s != $2 && t != $3 {print} {s=$2;t=$3}' file

which produces the output as you need. 它可以根据需要生成输出。

1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5

The idea is to store the second and third column values in variables s and t respectively and print the line contents only if the current line is unique . 想法是将第二和第三列值分别存储在变量st并仅在当前行唯一时才打印内容。

You can't set delimiters with uniq , it has to be white space. 你不能用uniq设置分隔符,它必须是空格。 With the help of tr you can tr的帮助下你可以

tr ',' ' ' <file | uniq -f1 | tr ' ' ','

1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5 

I found an answer which is not as elegant as Inian but satisfies my purpose. 我找到了一个不像Inian那样优雅但满足我的目的的答案。 Since my first column is always enoch time in microseconds and does not increase or decrease in characters, I can use the following uniq command: 因为我的第一列总是以微秒为单位的enoch时间,并且不会增加或减少字符,所以我可以使用以下uniq命令:

uniq -s 17

You can try to manually (with a loop) compare current line with previous line. 您可以尝试手动(使用循环)将当前行与前一行进行比较。

previous_line=""
# start at first line
i=1

# suppress first column, that don't need to compare
sed 's@^[0-9][0-9]*,@@' ./data_file > ./transform_data_file

# for all line within file without first column
for current_line in $(cat ./transform_data_file)
do 
  # if previous record line are same than current line
  if [ "x$prev_line" == "x$current_line" ]
  then
    # record line number to supress after
    echo $i >> ./line_to_be_suppress
  fi

  # record current line as previous line
  prev_line=$current_line

  # increment current number line
  i=$(( i + 1 ))
done

# suppress lines
for line_to_suppress in $(tac ./line_to_be_suppress) ; do sed -i $line_to_suppress'd' ./data_file ; done

rm line_to_be_suppress
rm transform_data_file

Since your first field seems to have a fixed length of 18 characters (including the , delimiter), you could use the -s option of uniq , which would be more optimal for larger files: 由于您的第一场似乎有18个字符(包括固定长度,分隔符),你可以使用-s的选项uniq ,这将是较大的文件更优化:

uniq -s 18 file

Gives this output: 给出这个输出:

1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5

From man uniq : 来自man uniq

-f num -f num

Ignore the first num fields in each input line when doing comparisons. 进行比较时,忽略每个输入行中的第一个num字段。 A field is a string of non-blank characters separated from adjacent fields by blanks. 字段是由空格与相邻字段分隔的一串非空字符。 Field numbers are one based, ie, the first field is field one. 字段编号是一个基础,即第一个字段是字段1。

-s chars -s chars

Ignore the first chars characters in each input line when doing comparisons. 在进行比较时,忽略每个输入行中的第一个字符字符。 If specified in conjunction with the -f option, the first chars characters after the first num fields will be ignored. 如果与-f选项一起指定,则将忽略第一个num字段后的第一个字符字符。 Character numbers are one based, ie, the first character is character one. 字符编号是一个基础,即第一个字符是字符1。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM