简体   繁体   English

sed命令删除文本,直到找到csv的每一行都匹配

[英]sed command to delete text until match is found for each line of a csv

I have a csv file and I am trying to delete all characters from the beginning of the line till it finds the first occurrence of "2015". 我有一个csv文件,并且尝试删除该行开头的所有字符,直到找到“ 2015”的第一次出现。 I want to do this for each line in the csv file. 我想对csv文件中的每一行执行此操作。

My csv file structure is as follows: 我的csv文件结构如下:

Field1 , Field2 , Field3 , Field4
sometext1 , 2015-07-15 , sometext2, sometext3
sometext1 , 2015-07-14 , sometext2, sometext3
sometext1 , 2015-07-13 , sometext2, sometext3

I cannot use the cut command or sed for the first occurrence of a comma because the text in the Field1 sometimes has commas in them too, which is making it complicated for parsing. 我不能使用cut命令或sed第一次出现逗号,因为Field1中的文本有时也包含逗号,这使得解析变得很复杂。 I figured if I search for the first occurrence of the text 2015 for each line and replace all the preceding characters with nothing, then that should work. 我想出了如果我为每一行搜索文本2015年的第一个匹配项,并将所有前面的字符替换为空,那么那应该可行。

FYI I only want to do this for the FIRST occurrence of 2015 only. 仅供参考,我只想在2015年的第一次比赛中这样做。 There is another text field with 2015 in it within another column and I don't any text prior to that to be affected. 在另一列中还有一个带有2015的文本字段,在此之前我没有任何文本受到影响。

For example, if my original line is: 例如,如果我的原始行是:

sometext1,#015,2015-07-10,sometext2,2015,sometext3

I want it to return: 我希望它返回:

2015-07-10,sometext2,2015,sometext3

Does anyone know the sed command to do this? 有人知道sed命令可以执行此操作吗?

Any help will be appreciated! 任何帮助将不胜感激!

Thanks 谢谢

Here is a way to do it with sed assuming "#####" never occurs in a line: 假设“ #####”从未出现在一行中,这是使用sed的一种方法:

sed -e 's/2015/#####&/'|sed -e 's/.*#####//'

For example: 例如:

> echo sometext1,#015,2015-07-10,sometext2,2015,sometext3\
  |sed -e 's/2015/#####&/'|sed -e 's/.*#####//'
2015-07-10,sometext2,2015,sometext3

The first sed command prefixes "#####" to the first occurence of 2015 and the second sed command removes everything from the beginning to the end of the "#####" prefix. 第一个sed命令以“ #####”为前缀,第一次出现在2015年,第二个sed命令删除从“ #####”前缀的开头到结尾的所有内容。

The basic reason for using this two stage method is that sed's regular expression matcher has only greedy wildcards that always pick the longest match and does not support lazy matching which picks the shortest match. 使用此两阶段方法的基本原因是sed的正则表达式匹配器仅具有贪婪的通配符,这些贪婪的通配符始终选择最长的匹配项,而不支持惰性匹配(即选择最短匹配项)。

If "#####" may occur in a line a more unlikely string could be substituted for it such as "7z#dNjm_wG8a3!esu@Rhv=". 如果一行中可能出现“ #####”,则可以用更不可能的字符串代替它,例如“ 7z#dNjm_wG8a3!esu @ Rhv =“。

To do this with sed without Perl-style non-greedy operators, you need to mark the first instance with something you know won't be in the line, as Tris describes. 要使用不带Perl样式非贪婪运算符的sed来执行此操作,您需要使用Tris所描述的标记您知道的第一个实例。 However, that solution requires knowledge of what won't be in the file. 但是,该解决方案需要了解文件中不会包含的内容。 Fortunately, you can guarantee that a newline won't be in the line because that's what terminated the line. 幸运的是,您可以保证换行符不会出现在行中,因为这是终止行的原因。 Thus you can do something like: 因此,您可以执行以下操作:

sed 's/2015/\n&/;s/.*\n//' input.txt > output.txt

NOTE: this won't modify the header row which you would have to treat specially. 注意:这不会修改您必须特别对待的标题行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM