简体   繁体   English

使用sed替换引号中的文本

[英]use sed to replace text just in quotes

I have this test file. 我有这个测试文件。

[root@localhost ~]# cat f.txt 
"a aa"  MM  "bbb  b"
MM    MM
MM"b b "
[root@localhost ~]#

I want to replace all space characters in the quotes, note, just in the quotes. 我想替换引号中的所有空格字符,注意,仅在引号中。 All characters out of the quotes should not be touched. 不应触及引号中的所有字符。 That is to say, what I want is something similar to: 也就是说,我想要的是类似于:

"a_aa"  MM  "bbb__b"
MM    MM
MM"b_b_"

Can this be implemented using sed ? 这可以用sed实现吗?

Thanks, 谢谢,

This is an entirely non-trivial question. 这是一个完全不平凡的问题。

This works replacing the first space inside quotes with underscore: 这可以用下划线替换引号内的第一个空格:

$ sed 's/\("[^ "]*\) \([^"]*"\)/\1_\2/g' f.txt
"a_aa"  MM  "bbb_ b"
MM    MM
MM"b_b "
$

For this example, where there are no more than two spaces inside any of the quotes, it is tempting to simply repeat the command, but it gives an incorrect result: 对于此示例,如果任何引号内部的空格不超过两个,则只需重复该命令即可,但结果不正确:

$ sed -e 's/\("[^ "]*\) \([^"]*"\)/\1_\2/g' \
>     -e 's/\("[^ "]*\) \([^"]*"\)/\1_\2/g' f.txt
"a_aa"_ MM  "bbb_ b"
MM    MM
MM"b_b_"
$

If your version of sed supports 'extended regular expressions', then this works for the sample data: 如果您的sed版本支持“扩展正则表达式”,那么这适用于示例数据:

$ sed -E \
>    -e 's/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/' \
>    -e 's/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/' \
>    -e 's/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/' \
>    f.txt
"a_aa"  MM  "bbb__b"
MM    MM
MM"b_b_"
$

You have to repeat that ghastly regex for every space within double quotes - hence three times for the first line of data. 你必须为双引号内的每个空间重复那个可怕的正则表达式 - 因此对于第一行数据是三次。

The regex can be explained as: 正则表达式可以解释为:

  • Starting at the beginning of a line, 从一行开始,
  • Look for sequences of 'zero or more non-quotes, optionally followed by a quote, no spaces or quotes, and a quote', the whole assembly repeated zero or more times, 查找“零或多个非引号的序列,可选地后跟引号,没有空格或引号,以及引号”,整个程序集重复零次或多次,
  • Followed by a quote, zero or more non-quotes, non-spaces, a space, and zero or more non-quotes, and a quote. 后面跟一个引号,零个或多个非引号,非空格,一个空格,零个或多个非引号和引号。
  • Replace the matched material with the leading part, the material at the start of the current quoted passage, an underscore, and the trailing material of the current quoted passage. 将匹配的材料替换为引导部分,当前引用通道开始处的材料,下划线和当前引用通道的尾随材料。

Because of the start anchor, this has to be repeated once per blank...but sed has a looping construct, so we can do it with: 由于启动锚点,每个空白都需要重复一次...但是sed有一个循环结构,所以我们可以这样做:

$ sed -E -e ':redo
>            s/^(([^"]*("[^ "]*")?)*)("[^ "]*) ([^"]*")/\1\4_\5/
>            t redo' f.txt
"a_aa"  MM  "bbb__b"
MM    MM
MM"b_b_"
$

The :redo defines a label; :redo定义了一个标签; the s/// command is as before; s///命令和以前一样; the t redo command jumps to the label if there was any substitution done since the last read of a line or jump to a label. 如果自上次读取行或跳转到标签以来已完成任何替换,则t redo命令将跳转到标签。


Given the discussion in the comments, there are a couple of points worth mentioning: 鉴于评论中的讨论,有几点值得一提:

  1. The -E option applies to sed on MacOS X (tested 10.7.2). -E选项适用于MacOS X上的sed (已测试10.7.2)。 The corresponding option for the GNU version of sed is -r (or --regex-extended ). GNU版本的sed的相应选项是-r (或--regex-extended )。 The -E option is consistent with grep -E (which also uses extended regular expressions). -E选项与grep -E (也使用扩展正则表达式)一致。 The 'classic Unix systems' do not support EREs with sed (Solaris 10, AIX 6, HP-UX 11). “经典的Unix系统”不支持使用sed ERE(Solaris 10,AIX 6,HP-UX 11)。

  2. You can replace the ? 你可以更换? I used (which is the only character that forces the use of an ERE instead of a BRE) with * , and then deal with the parentheses (which require backslashes in front of them in a BRE to make them into capturing parentheses), leaving the script: 我使用*(这是强制使用ERE而不是BRE的唯一字符)和* ,然后处理括号(在BRE中需要反斜杠以使它们成为捕获括号),离开脚本:

     sed -e ':redo s/^\\(\\([^"]*\\("[^ "]*"\\)*\\)*\\)\\("[^ "]*\\) \\([^"]*"\\)/\\1\\4_\\5/g t redo' f.txt 

    This produces the same output on the same input - I tried some slightly more complex patterns in the input: 这在同一输入上产生相同的输出 - 我在输入中尝试了一些稍微复杂的模式:

     "a aa" MM "bbb b" MM MM MM"bb " "cc""dd""ee" X " f "" g " "CC" "DD" "EE" x " F " " G " 

    This gives the output: 这给出了输出:

     "a_aa" MM "bbb__b" MM MM MM"b_b_" "c_c""d_d""e__e" X "_f_""_g_" "C_C" "D_D" "E__E" x "_F_" "_G_" 
  3. Even with BRE notation, sed supported the \\{0,1\\} notation to specify 0 or 1 occurrences of the previous RE term, so the ? 即使使用BRE表示法, sed支持\\{0,1\\}表示法指定前一个RE术语的0或1次出现,所以? version could be translated to a BRE using: 版本可以使用以下内容翻译为BRE:

     sed -e ':redo s/^\\(\\([^"]*\\("[^ "]*"\\)\\{0,1\\}\\)*\\)\\("[^ "]*\\) \\([^"]*"\\)/\\1\\4_\\5/g t redo' f.txt 

    This produces the same output as the other alternatives. 这产生与其他替代品相同的输出。

A somehow unusual answer in XSLT 2.0 : 在XSLT 2.0中有点不寻常的答案:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="2.0">
    <xsl:output method="text"></xsl:output>
    <xsl:template name="init">
        <xsl:for-each select="tokenize(unparsed-text('f.txt'),'&#10;')">
            <xsl:for-each select="tokenize(.,'&quot;')">
                <xsl:value-of select="if (position() mod 2 = 0) 
                  then concat('&quot;',translate(.,' ','_'),'&quot;') else ."></xsl:value-of>
            </xsl:for-each>
            <xsl:text>&#10;</xsl:text>
        </xsl:for-each>
    </xsl:template>    
</xsl:stylesheet>

To test if, just get saxon.jar on sourceforge and use the following command line : 要测试是否,只需在sourceforge上获取saxon.jar并使用以下命令行:

java -jar saxon9.jar -it:init regexp.xsl

The xslt file include the reference to the f.txt, the text file must be in the same directory as the xslt file. xslt文件包含对f.txt的引用,文本文件必须与xslt文件位于同一目录中。 That can be easily changed by giving a parameter to the stylesheet. 通过为样式表提供参数可以很容易地改变这一点。

It works in one pass. 它一次通过。

This would be really easy if the quoted text was all on separate lines. 如果引用的文本全部在不同的行上,这将非常容易。 So one approach is to split the text so you'll have that, do the easy transform, then rebuild the lines. 因此,一种方法是拆分文本,这样你就可以进行简单的转换,然后重建线条。

Splitting the text is easy, but we'll need to distinguish between newlines that were 拆分文本很容易,但我们需要区分新行

  1. already present in the file 已存在于文件中
  2. added by us 我们补充说

To do that, we can end each line with a symbol indicating to which class it belongs. 为此,我们可以用一个符号来结束每一行,指示它属于哪个类。 I'll just use 1 and 2, corresponding directly to the above. 我只使用1和2,直接对应上面的内容。 In sed, we have: 在sed中,我们有:

sed -e 's/$/1/' -e 's/"[^"]*"/2\n&2\n/g'

This produces: 这会产生:

2
"a aa"2
  MM  2
"bbb  b"2
1
MM    MM1
MM2
"b b "2
1

That's easy to transform, just use 这很容易转换,只需使用

sed -e '/".*"/ s/ /_/g' 

giving

2
"a_aa"2
  MM  2
"bbb__b"2
1
MM    MM1
MM2
"b_b_"2
1

Finally, we need to put it back together. 最后,我们需要将它重新组合在一起。 This is actually pretty horrible in sed, but feasible using the hold space: 这在sed中实际上非常可怕,但使用保持空间是可行的:

sed -e '/1$/ {s/1$//;H;s/.*//;x;s/\n//g}' -e '/2$/ {s/2$//;H;d}'

(This would be a lot clearer in, eg, awk.) (这会更加清晰,例如,awk。)

Pipe those three steps together and you're done. 将这三个步骤组合在一起,你就完成了。

These might work for you: 这些可能对你有用:

 sed 's/^/\n/;:a;s/\(\n[^"]*"[^ "]*\) \([^"]*"\)\n*/\1_\2\n/;ta;s/\n//;ta;s/\n//' file

Explanation: 说明:

Prepend a \\n to the start of the line, this will be used to bump along the substitutions. \\n到行的开头,这将用于沿着替换进行碰撞。 Replace a single 替换一个 with a _ within the " 's and whilst it's there place a \\n ready for the next round of substitutions. Having replaced all the _的范围内"的和而它的存在,放置\\n准备下一轮替换。已经全部换成了 's, delete the \\n and repeat. 的,删除\\n并重复。 When all substitutions have occured delete the \\n delimiter. 当所有替换都发生时删除\\n分隔符。

or this: 或这个:

sed -r ':a;s/"/\n/;s/"/\n/;:b;s/(\n[^\n ]*) ([^\n]*\n)/\1_\2/g;tb;s/\n/%%%/g;ta;s/%%%/"/g' file

Explanation: 说明:

Replace the first set of "" 's with \\n 's. \\n替换第一组"" Replace first space between newlines with _ , repeat. _重复换行符替换换行符之间的第一个空格。 Replace \\n 's with a unique delimiter ( %%% ), repeat from the beginning. 用唯一的分隔符( %%% )替换\\n ,从头开始重复。 Tidy up at the end by replacing all %%% with " 's. 最后将所有%%%替换为" s "整理。

A third way: 第三种方式:

sed 's/"[^"]*"/\n&\n/g;$!s/$/@@@/' file |
sed '/"/y/ /_/;1{h;d};H;${x;s/\n//g;s/@@@/\n/g;p};d'

Explanation: 说明:

Surround all quoted expressions ( "..." ) with newlines ( \\n 's). 用换行符( \\n 's)包围所有带引号的表达式( "..." )。 Insert an end-of-line delimiter @@@ on all but the last line. 在除最后一行之外的所有行上插入行尾分隔符@@@ Pipe result to second sed command. 将结果传递给第二个sed命令。 Translate all 全部翻译 's to _ 's for lines with a " in them. Store every line in the hold space (HS). At end of file, swap to the HS and delete all \\n 's and replace end-of-line delimiters with \\n 's 对于带有" in”的行来说是_的。将每一行存储在保留空间(HS)。在文件末尾,交换到HS并删除所有\\n并替换行尾分隔符\\n

lastly: 最后:

sed 's/\("[^"]*"\)/$(tr '"' ' '_'"'<<<'"'"'\1'"'"')/g;s/^/echo /' file | sh

or GNU sed: 或GNU sed:

sed 's/\("[^"]*"\)/$(tr '"' ' '_'"'<<<'"'"'\1'"'"')/g;s/^/echo /e' file

left for the reader to work out. 留给读者解决。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM