简体   繁体   English

sed查找和替换两个字符串之间的字符

[英]sed to Find and Replace chacters between two strings

I have a pipe delimited file where some values/records in one of the columns contain pipes in the value itself making it appear as though there are more columns than there actually are - Notice how "column 8" (bolded) has pipes in the middle. 我有一个管道分隔文件,其中一个列中的某些值/记录包含值本身的管道,使得它看起来好像有比实际更多的列 - 请注意“列8”(粗体)如何在中间有管道。 This should actually display as "|col u lm n8|" 这应该显示为“| col u lm n8 |” with spaces in place of the pipes. 用空格代替管道。

column1|column2|column3|column4|column5|column6|column7|**col|u|lm|n8**|2016|column10|column11|column12|column13|column14|

I need to replace these pipe's within column8 with spaces. 我需要用空格替换column8中的这些管道。

Good thing is that the data in column7 and column9 (|2016) is the same across the file so I'm able to do a sed such as this 好的是第7列和第9列(| 2016)中的数据在整个文件中是相同的,所以我能够做一个这样的sed

sed 's/|/ /7g;s/.\(|2016\)/|\1/' 

However that will change all pipes after the 7th pipe to the end of the line. 但是,这会将第7个管道后的所有管道更改为生产线末端。 My question is how can I get it to change all pipes to spaces after the 7th pipe but up to the "|2016" column ? 我的问题是如何在第7个管道之后将所有管道更改为空格,但直到“| 2016”列?

Thank you 谢谢

With your sample input this works for me with GNU sed 4.2.2: 使用您的示例输入,这适用于GNU sed 4.2.2:

sed -r ':start s/(column7.)([^\|]*?)\|(.*?.2016)/\1\2 \3/; t start' file

It replaces pipes between column7. 它取代了column7.之间的column7. and .2016 , one pipe at a time. .2016 ,一次一个管道。 After an successful substitution, the t gotos back to the :start label for another substitution attempt. 在成功替换之后, t gotos返回到:start标签以进行另一次替换尝试。

Building on what Lars provided, the following should work in all versions of sed: 基于Lars提供的内容,以下内容应适用于所有版本的sed:

sed -e ':b' -e 's/\(|column7|\)\(.*\)|\(.*|2016|\)/\1\2 \3/' -e 'tb' inputfile

This works by repeatedly replacing embedded separators until the substitute pattern can't be found. 这通过重复替换嵌入式分隔符直到找不到替代模式来工作。 Sed's t command branches to the :b label only if the previous substitution was successful. 只有先前的替换成功,Sed的t命令才会分支到:b标签。

We use the more classic BRE both for compatibility and to avoid sed interpreting the vertical bars as "or" separators in ERE. 我们使用更经典的BRE来兼容并避免sed将竖条解释为ERE中的“或”分隔符。

The sed script is separated into individual -e options because some varieties of sed require label references to be "at the end of the line", and the termination of -e 's argument is considered to be equivalent to the end of the line. sed脚本被分成单独的-e选项,因为某些种类的sed要求标签引用“在行的末尾”,并且-e的参数的终止被认为等同于行的结尾。 (GNU sed doesn't require this, but a number of other seds do.) (GNU sed不需要这个,但是其他一些seds也是如此。)

But as anubhava points out in comments, this is an inferior approach because it will fail if the input data includes a second 2016| 但正如anubhava在评论中指出的那样,这是一种较差的方法,因为如果输入数据包含第二个2016|它将会失败2016| somewhere to the right of column 9. 在第9栏右侧的某个地方。

An alternate solution, if you're running bash, could be to place the fields into an array, then merge elements: 如果你正在运行bash,另一种解决方案可能是将字段放入数组中,然后合并元素:

#!/usr/bin/env bash

input="column1|column2|column3|column4|column5|column6|column7|**col|u|lm|n8**|2016|column10|column11|column12|column13|column14|"

IFS=\| read -a a <<< "$input"

while [ "${a[8]}" != "2016" ]; do
  a[7]="${a[7]} ${a[8]}"   # merge elements
  unset a[8]               # delete merged element
  a=( "${a[@]}" )          # renumber array
done

printf "%s|" "${a[@]}"

Note that bash arrays start at index 0 by default. 请注意,bash数组默认从索引0开始。 The readarray builtin allows you to specify an alternate start point for your index ( -O ), but that builtin started with bash version 4, and there's still a lot of version 3 in the wild. readarray内置允许您为索引( -O )指定一个备用起点,但是内置以bash版本4开始,并且仍然有很多版本3在野外。 So for portability, read -a it is. 所以为了便携性,请read -a它。

Note also that without further error checking, the above script goes into an endless loop if for some reason you don't have a "2016" field in your input data. 另请注意,如果由于某种原因您的输入数据中没有“2016”字段,则无需进一步错误检查,上述脚本将进入无限循环。 :-) :-)

Here is perl solution that will work for case even when |2016 appears again in the line: 这是perl解决方案,即使在|2016再次出现在行中时也适用于案例:

cat file
column1|column2|column3|column4|column5|column6|en|col|u|lm|n8|2016|column10|column11|2016|

perl -pe 's/(en\|[^|]*|(?<!^)\G[^|]*)\|(?!2016)/$1 /g' file

column1|column2|column3|column4|column5|column6|en|col u lm n8|2016|column10|column11|2016|

This regex use PCRE construct \\G , which asserts position at the end of the previous match or the start of the string for the first match. 此正则表达式使用PCRE构造\\G ,它在上一个匹配结束时断言位置或在第一个匹配的字符串开头处断言。

RegEx Demo RegEx演示

This question is really interested me, I upvoted it and failed solving it in sed or awk 这个问题真的让我很感兴趣,我赞成它并且在sedawk无法解决它

I tried it in python and made it. 我在python中尝试过并制作它。 I am not providing an official answer but some ideas:) 我没有提供official answer但有些想法:)

$cat sample.csv
column1|column2|column3|column4|column5|column6|column7|col|u|lm|n8|2016|column10|column11|column12|column13|column14|

My code: 我的代码:

$cat test.py                                                                                                                                                                           
import re
REGEX = ur"column7\|(.+?)\|2016+?"

with open("sample.csv", "r") as inputs:
    for line in inputs:
        matches = re.findall(REGEX, line)
        column8 = matches[0]
        new_column8 = column8.replace("|", "")
        print line.replace(column8, new_column8)

Result: 结果:

$python test.py                                                                                                                                                                       
column1|column2|column3|column4|column5|column6|column7|colulmn8|2016|column10|column11|column12|column13|column14|

With GNU awk for the 3rd arg to match(): 使用GNU awk为第3个arg匹配():

$ awk 'match($0,/(([^|]*[|]){7})(.*)(\|2016\|.*)/,a){gsub(/\|/," ",a[3]); $0=a[1] a[3] a[4]} 1' file
column1|column2|column3|column4|column5|column6|column7|**col u lm n8**|2016|column10|column11|column12|column13|column14|

When the file would have only one line, you could do col8=$(sed 's/([^|] |){7}(. )|2016. /\\2/' file ) echo "Debug line: col8=${col8}, fixed ${col8//|/}" sed 's/^(([^|] |){7}).*|2016/\\1'"${col8//|/}"'|2016/' file 当文件只有一行时,你可以做col8 = $(sed的/([^ |] |){7}(。 )| 2016。/ \\ 2 /'文件)echo“调试行:col8 = $ {col8},修复$ {col8 // | /}“sed's / ^(([^ |] |){7})。* | 2016 / \\ 1'”$ {col8 // | /}“ '| 2016 /'文件

When you know an unique character or string, you can do about the same for a file with more lines. 当您知道一个唯一的字符或字符串时,对于包含更多行的文件,您可以执行相同的操作。 I will use mk97 as unique string: 我将使用mk97作为唯一字符串:

This might work for you (GNU sed): 这可能适合你(GNU sed):

sed 's/|/&\n/7;:a;ta;s/\n\(|2016|\)/\1/;s/\n|/ \n/;ta;s/\n\(.\)/\1\n/;ta' file

Append a newline to the start of the field eight. 在字段8的开头添加换行符。 If the newline presents itsself before field nine, delete it. 如果换行在第9场之前出现,请将其删除。 If the newline is followed by a | 如果换行后跟一个| replace the | 替换| by a space and shuffle the newline on a character. 通过空格并在角色上移动换行符。 If the newline is not followed by a | 如果换行符后面没有| shuffle the newline on a character. 在角色上播放换行符。

NB On any successful substitution loop to the place holder :a . NB在任何成功的替换循环到占位符:a

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM