简体   繁体   English

如何使用Shell脚本删除CSV文件中多行通用的特定字符串?

[英]How to remove a specific string common in multiple lines in a CSV file using shell script?

I have a csv file which contains 65000 lines (Size approximately 28 MB). 我有一个csv文件,其中包含65000行(大小约为28 MB)。 In each of the lines a certain path in the beginning is given eg "c:\\abc\\bcd\\def\\123\\456". 在每一行中,都以开头指定路径,例如“ c:\\ abc \\ bcd \\ def \\ 123 \\ 456”。 Now let's say the path "c:\\abc\\bcd\\" is common in all the lines and rest of the content is different. 现在,假设路径“ c:\\ abc \\ bcd \\”在所有行中都是通用的,其余内容则有所不同。 I have to remove the common part (In this case "c:\\abc\\bcd\\") from all the lines using a shell script. 我必须使用shell脚本从所有行中删除公共部分(在本例中为“ c:\\ abc \\ bcd \\”)。 For example the content of the CSV file is as mentioned. 例如,CSV文件的内容如前所述。

C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.frag                   0   0   0
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.vert                   0   0   0
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0.frag       16  24  3
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0.vert       87  116 69
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0.vert.bin   75  95  61
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0            0   0
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-6            0   0   0 

In the above example I need the output as below 在上面的示例中,我需要以下输出

FILE0.frag                  0   0   0
FILE0.vert                  0   0   0
FILE0.link-link-0.frag      17  25  2
FILE0.link-link-0.vert      85  111 68
FILE0.link-link-0.vert.bin  77  97  60
FILE0.link-link-0               0   0
FILE0.link                  0   0   0

Can any of you please help me out with this? 谁能帮我这个忙吗?

You could use sed : 您可以使用sed

$ cat test.csv 
"c:\abc\bcd\def\123\456", 1, 2
"c:\abc\bcd\def\234\456", 1, 2
"c:\abc\bcd\def\432\456", 3, 4

$ sed -i.bak -e 's/c\:\\abc\\bcd\\//1' test.csv

$ cat test.csv
"def\123\456", 1, 2
"def\234\456", 1, 2
"def\432\456", 3, 4

I am using sed here in this way: 我在这里以这种方式使用sed

sed -e 's/<SEARCH TERM>/<REPLACE_TERM>/<OCCURANCE>' FILE

where 哪里

  • <SEARCH TERM> is what we are looking for (in this case c:\\abc\\bcd\\ , but backslashes need to be escaped). <SEARCH TERM>是我们要查找的内容(在本例中为c:\\abc\\bcd\\ ,但是反斜杠需要转义)。
  • <REPLACE TERM> is what we want to replace it with, in this case nothing, and <REPLACE TERM>是我们要替换的内容,在这种情况下,什么也没有,并且
  • <OCCURANCE> is which occurance of the item we want to replace, in this case the first item in each line. <OCCURANCE>是我们要替换的项目的哪种情况,在这种情况下,这是每行中的第一个项目。

( -i.bak stands for: Don't output, just edit this file. (but make a backup first)) -i.bak代表:不输出,仅编辑此文件。(但请先进行备份))

Updated according to @david-c-rankin comment. 根据@ david-c-rankin注释进行了更新。 He is right, make a backup before editing files in case you make a mistake. 他说的没错,请在编辑文件之前进行备份,以防万一您输入错误。

# init variable
MaxPath="$( sed -n 's/,.*//p;1q' YourFile )"
GrepPath="^$( printf "%s" "${MaxPath}" | sed 's#\\#\\\\#g' )"

# search the biggest pattern to remove
while [ ${#MaxPath} -gt 0 ] && [ $( grep -c -v -E "${GrepPath}" YourFile ) -gt 0 ]
 do
   MaxPath="${MaxPath%%?}"
   GrepPath="^$( printf "%s" "${MaxPath}" | sed 's#\\#\\\\#g' )"
 done

# Adapt your file
if [ ${#MaxPath} -gt 0 ]
 then
   sed "s#${GrepPath}##" YourFile
 fi
  • Assuming for the sample that there is no special regex char nor # in MaxPath 假设该示例在MaxPath中没有特殊的正则表达式char或#
  • the grep -c -v -E is not optimized in term of performance (treat whle file each time where it can stop at first miss) grep -c -v -E在性能方面未进行优化(每次可能在第一次丢失时停止的地方都处理文件)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM