如何使用Shell脚本删除CSV文件中多行通用的特定字符串？

Question

I have a csv file which contains 65000 lines (Size approximately 28 MB). 我有一个csv文件，其中包含65000行（大小约为28 MB）。 In each of the lines a certain path in the beginning is given eg "c:\\abc\\bcd\\def\\123\\456". 在每一行中，都以开头指定路径，例如“ c：\\ abc \\ bcd \\ def \\ 123 \\ 456”。 Now let's say the path "c:\\abc\\bcd\\" is common in all the lines and rest of the content is different. 现在，假设路径“ c：\\ abc \\ bcd \\”在所有行中都是通用的，其余内容则有所不同。 I have to remove the common part (In this case "c:\\abc\\bcd\\") from all the lines using a shell script. 我必须使用shell脚本从所有行中删除公共部分（在本例中为“ c：\\ abc \\ bcd \\”）。 For example the content of the CSV file is as mentioned. 例如，CSV文件的内容如前所述。

C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.frag                   0   0   0
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.vert                   0   0   0
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0.frag       16  24  3
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0.vert       87  116 69
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0.vert.bin   75  95  61
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0            0   0
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-6            0   0   0

In the above example I need the output as below 在上面的示例中，我需要以下输出

FILE0.frag                  0   0   0
FILE0.vert                  0   0   0
FILE0.link-link-0.frag      17  25  2
FILE0.link-link-0.vert      85  111 68
FILE0.link-link-0.vert.bin  77  97  60
FILE0.link-link-0               0   0
FILE0.link                  0   0   0

Can any of you please help me out with this? 谁能帮我这个忙吗？

Answer 1

You could use sed : 您可以使用sed ：

$ cat test.csv 
"c:\abc\bcd\def\123\456", 1, 2
"c:\abc\bcd\def\234\456", 1, 2
"c:\abc\bcd\def\432\456", 3, 4

$ sed -i.bak -e 's/c\:\\abc\\bcd\\//1' test.csv

$ cat test.csv
"def\123\456", 1, 2
"def\234\456", 1, 2
"def\432\456", 3, 4

I am using sed here in this way: 我在这里以这种方式使用sed ：

sed -e 's/<SEARCH TERM>/<REPLACE_TERM>/<OCCURANCE>' FILE

where 哪里

<SEARCH TERM> is what we are looking for (in this case c:\\abc\\bcd\\ , but backslashes need to be escaped). <SEARCH TERM>是我们要查找的内容（在本例中为c:\\abc\\bcd\\ ，但是反斜杠需要转义）。
<REPLACE TERM> is what we want to replace it with, in this case nothing, and <REPLACE TERM>是我们要替换的内容，在这种情况下，什么也没有，并且
<OCCURANCE> is which occurance of the item we want to replace, in this case the first item in each line. <OCCURANCE>是我们要替换的项目的哪种情况，在这种情况下，这是每行中的第一个项目。

( -i.bak stands for: Don't output, just edit this file. (but make a backup first)) （ -i.bak代表：不输出，仅编辑此文件。（但请先进行备份））

Updated according to @david-c-rankin comment. 根据@ david-c-rankin注释进行了更新。 He is right, make a backup before editing files in case you make a mistake. 他说的没错，请在编辑文件之前进行备份，以防万一您输入错误。

Answer 2

# init variable
MaxPath="$( sed -n 's/,.*//p;1q' YourFile )"
GrepPath="^$( printf "%s" "${MaxPath}" | sed 's#\\#\\\\#g' )"

# search the biggest pattern to remove
while [ ${#MaxPath} -gt 0 ] && [ $( grep -c -v -E "${GrepPath}" YourFile ) -gt 0 ]
 do
   MaxPath="${MaxPath%%?}"
   GrepPath="^$( printf "%s" "${MaxPath}" | sed 's#\\#\\\\#g' )"
 done

# Adapt your file
if [ ${#MaxPath} -gt 0 ]
 then
   sed "s#${GrepPath}##" YourFile
 fi

Assuming for the sample that there is no special regex char nor # in MaxPath 假设该示例在MaxPath中没有特殊的正则表达式char或＃
the grep -c -v -E is not optimized in term of performance (treat whle file each time where it can stop at first miss) grep -c -v -E在性能方面未进行优化（每次可能在第一次丢失时停止的地方都处理文件）

如何使用Shell脚本删除CSV文件中多行通用的特定字符串？

问题描述

2 个解决方案

解决方案1
1 2015-04-15 08:39:45

解决方案2
0 2015-04-15 12:45:20

如何使用Shell脚本删除CSV文件中多行通用的特定字符串？

问题描述

2 个解决方案

解决方案1 1 2015-04-15 08:39:45

解决方案2 0 2015-04-15 12:45:20

解决方案1
1 2015-04-15 08:39:45

解决方案2
0 2015-04-15 12:45:20