简体   繁体   English

bash字符串比较两个csv文件

[英]bash string compare between 2 csv files

I have two strings (parsed from csv) which has ~200 columns each. 我有两个字符串(从csv解析),每个都有〜200列。 I need to compare them and identify which column is different. 我需要比较它们并确定哪一列是不同的。 Example: 例:

str1file1="a,b,c,d,e,f,pp,qq"
str2file2="a,b,c,d,x,f,pp,qq"

I need to get column number as 5 and corresponding values as my output. 我需要将列号设为5,并将相应的值作为输出。 Example: 5 ef As I need to compare millions of such strings, speed is the key. 示例:5 ef因为我需要比较数百万个这样的字符串,所以速度是关键。 Actual record - 实际记录-

0x0009aeef,xyz,wert,57116,192.168.17.1,45320,192.168.17.2,45320,ctty,lkipop,1408477403,1408477403,,1408477722,1408477403,1408477718,2,0,5,98,0,3055925732,0,0,0,0,15756,15732,24,0,0,0,0,0,0,0.68,23,0,1,23,15776,0.00,15270,459,1,0,0,0,0,0,0,0,0,0,5.755,1408477403,1408477718,2,0,7,98,0,112988428,0,0,0,0,15776,15742,34,0,0,0,0,0,0,8.32,33,0,1,33,15756,0.01,15555,185,0,0,0,0,0,0,0,0,0,0,3.077,-0,-0,-12,-11,-23,-36,-31,-39,22,35,19,28,,,,,1.8,2.4,2.2,2.6,1.8,2.4,2.2,2.5,37,49,45,52,36,48,44,51,15625,107,891,5.60,12528,3204,14430,1312,723,2.65,13291,2451

0x0009aeef is a primary key/column (1st col), however it's not assured that both file has identical number of entries (rows). 0x0009aeef是主键/列(第一个列),但是不能保证两个文件的条目(行)数都相同。 I'm using sort wrt primary key and get required columns (~135) using cut creating temp files . 我正在使用排序wrt主键并使用cut创建临时文件获取所需的列(〜135)。 Followed by 'while read' to read 1st temp file and grep to get matching lines on temp2 file. 随后是“ while read”,它读取第一个临时文件,而grep则获得temp2文件上的匹配行。 If grep fails, chances are key or values are different. 如果grep失败,则机会是关键或值是不同的。 Hence awk for key and values. 因此,awk用于键和值。 Any better approach much appreciated. 任何更好的方法,不胜感激。 Here is present code - 这是当前代码-

sort --field-separator=',' --key=1 $csv1 | cut -d "," -f1,...134 | tr -d '\t' > file1
sort --field-separator=',' --key=1 $csv2 | cut -d "," -f1,...134 | tr -d '\t' > file2
while read line; do
      sl=`grep "$line" file2`
      if [ "$line" != "$sl" ]; then
         rec=`echo $line | awk -F, '{ print $1 }'`
         slId=`grep $rec file2 | awk -F, '{ print $1 }'`
         if [ "$rec" = "$slId" ]; then
               #validation failed, primary key found
         else
               #prim key not found
        fi
     else
        #all is well
     fi
done < file1

If speed is the key, I'd consider parsing the CSV files using mawk or update the post with file examples so we can offer a better solution. 如果速度是关键,我会考虑使用mawk解析CSV文件或使用文件示例更新帖子,以便我们提供更好的解决方案。

Using Bash: 使用Bash:

IFS=, read -a line <<<"$str1"
IFS=, read -a line2 <<<"$str2"
for i in ${!line[@]}; do
    if [[ ${line[i]} != ${line2[i]} ]]; then
        echo -e "${line[i]}\n${line2[i]}"
    fi
done

Output: 输出:

e
x

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM