简体   繁体   English

使用sed查找并复制到新文件

[英]find and copy to new file using sed

I have multiple lines in a file. 我的文件中有多行。 Each line has a common start tag and end tag. 每行都有一个公共的开始标签和结束标签。 I want to get the contents between the tag and put that in a new file separated by /r. 我想获取标记之间的内容,并将其放入一个用/ r分隔的新文件中。

1) I tried the following .. but its copying the entire line and putting into the new file 1)我尝试了以下..但它复制了整行并将其放入新文件

#!/bin/sh

startline="<Mytag>"
endline="<Nexttag>"

echo $startline
echo $endline

sed "/$startline/,/$endline/!d" input.txtt > test.txt

2) Ideally the end tag should be </Mytag> but sed is not taking the '/' very well. 2)理想情况下,结束标签应为</Mytag>但sed不能很好地使用'/' How to overcome this? 如何克服呢? Should I use a '//' ? 我应该使用'//'吗?

Thanks 谢谢


update 更新


input.txt has the following lines input.txt具有以下几行

<?xml version="1.0" encoding="UTF-8" ?><InputRecord xmlns:xsi= "http://www.w3.org/2001/XMLSchema-instance" <tag1>blah</tag1><mytag>myinfo</mytag><tag2>blah</tag2></InputRecord>

<?xml version="1.0" encoding="UTF-8" ?><InputRecord xmlns:xsi= "http://www.w3.org/2001/XMLSchema-instance" <tag1>blah1</tag1><mytag>myinfo1</mytag><tag2>blah2</tag2></InputRecord>

expected output 预期产量

myinfo
myinfo1

Answer for revised question 修订问题的答案

Given input: 给定输入:

 <?xml version="1.0" encoding="UTF-8" ?><InputRecord xmlns:xsi= "http://www.w3.org/2001/XMLSchema-instance" <tag1>blah</tag1><mytag>myinfo</mytag><tag2>blah</tag2></InputRecord> <?xml version="1.0" encoding="UTF-8" ?><InputRecord xmlns:xsi= "http://www.w3.org/2001/XMLSchema-instance" <tag1>blah1</tag1><mytag>myinfo1</mytag><tag2>blah2</tag2></InputRecord> 

the output should be: 输出应为:

 myinfo myinfo1 

Temporarily ignoring the fact that parsing XML with regular expressions is generally not sensible, this can be treated as a request to find the text between a start tag and an end tag on a single line. 暂时忽略使用正则表达式解析XML通常不明智的事实,可以将其视为在一行上的开始标记和结束标记之间查找文本的请求。 This translates to: 转换为:

starttag="<mytag>"
endtag="</mytag>"
sed -n "\%.*$starttag\(.*\)$endtag.*% s//\1/p"

The \\% notation is required by POSIX sed to allow the use of something other than a slash as the delimiter for a regular expression. POSIX sed需要\\%表示法,以允许使用除斜杠以外的其他内容作为正则表达式的定界符。 POSIX sed says: POSIX sed说:

... a context address (which consists of a BRE, as described in Regular Expressions in sed , preceded and followed by a delimiter, usually a <slash> ) ...上下文地址(由BRE组成,如sed中的正则表达式中所述,在其之前和之后是定界符,通常为<slash>

and: 和:

In a context address, the construction "\\cBREc" , where c is any character other than <backslash> or <newline> , shall be identical to "/BRE/" . 在上下文地址中,结构"\\cBREc" (其中c<backslash><newline>以外的任何字符)应与"/BRE/"相同。 If the character designated by c appears following a <backslash> , then it shall be considered to be that literal character, which shall not terminate the BRE. 如果c指定的字符出现在<backslash>之后,则应将其视为该文字字符,该字符不得终止BRE。 For example, in the context address "\\xabc\\xdefx" , the second x stands for itself, so that the BRE is "abcxdef" . 例如,在上下文地址"\\xabc\\xdefx" ,第二个x代表自身,因此BRE为"abcxdef"

Answer for original version of question 原始问题的答案

Your script should work as written if you get the $endline value correct. 如果您正确设置了$endline值,则脚本应按书面形式工作。 However, IMNSHO, it is simpler to be positive about the range to print: 但是,对于IMNSHO,确定要打印的范围更简单:

sed -n "/$startline/,/$endline/p" input.txtt > test.txt

The -n means 'do not print unless I tell you to' and the script ways 'print between the line matching the start line and the line matching the end line. -n表示“除非我告诉您,否则不要打印”,而脚本方式是“在与起始行匹配的行与与结束行匹配的行之间打印”。

For the end tag with the slash in it, you need to escape the slash with a backslash: 对于其中带有斜杠的结束标记,您需要使用反斜杠将斜杠转义:

endline="<\/Nexttag>"

Or you could use a . 或者您可以使用. in place of the slash, which could in theory match the start of <XNexttag> but probably won't. 代替斜线,理论上可以与<XNexttag>的开始匹配,但可能不会。 The absence of the backslash would account for why you got everything from the start line to the end of file. 缺少反斜杠将解释为什么从起始行到文件末尾都有所有内容。


On the benefits of positivitity 关于积极性的好处

Consider the data file: 考虑数据文件:

line1
line2 start1
line3
line4 end1
line5
line6 start2
line7
line8 end2
line9

And consider the shell and sed commands: 并考虑一下shell和sed命令:

echo Positive Single
sed -n -e '/start1/,/end1/p'  data
echo Negative Single
sed    -e '/start1/,/end1/!d' data

echo Positive Double
sed -n -e '/start1/,/end1/p'  -e '/start2/,/end2/p'  data
echo Negative Double
sed    -e '/start1/,/end1/!d' -e '/start2/,/end2/!d' data

The output from running that script is: 运行该脚本的输出为:

$ sh sed.scripts
Positive Single
line2 start1
line3
line4 end1
Negative Single
line2 start1
line3
line4 end1
Positive Double
line2 start1
line3
line4 end1
line6 start2
line7
line8 end2
Negative Double
$

For the case of a single pattern range to match, there's no problem with the !d formulation vs the -n plus p formulation. 对于单个模式范围要匹配的情况, !d公式与-n plus p公式没有问题。

However, the 'positive double' pattern works fine, producing the answer I'd expect, for 'print the lines between start1 and end1 and also the lines between start2 and end2 ', whereas the 'negative double' pattern does not work correctly any more. 但是,“双积极”模式工作正常,生产的答案,我期望,因为“打印启动1END1START2END2之间的线之间的线”,而“负双”模式不正确任何工作更多。 I'd rather use the extensible version than the version that has to be rewritten when the requirement changes. 我宁愿使用可扩展版本,也不愿在需求更改时必须重写该版本。

To escape the slashes, precede them with a back slash, like this: 要转义斜杠,请在它们前面加上反斜杠,如下所示:

<\/Nexttag>

But you only need that because you've chosen to use a slash as your delimiter. 但您只需要这样做,因为您已选择使用斜杠作为分隔符。 You can use any character you want (slash is conventionally chosen because many other languages use it to delimit regexes). 您可以使用所需的任何字符(通常选择斜杠,因为许多其他语言使用斜杠来分隔正则表达式)。 So chose a character that won't appear in tags, like a hash #: 因此,请选择一个不会出现在标签中的字符,例如井号#:

sed "#$startline#,#$endline#!d" input.txtt > test.txt

This is probably not the most optimal solution, but it produces the expected output for your sample input: 这可能不是最佳的解决方案,但它会为您的样本输入产生预期的输出:

#!/bin/sh

startline="<mytag>"
endline="<\/mytag>"

awk '{ gsub(">", "&\n"); gsub("<", "\n&"); print; }' | sed -e "/$startline/,/$endline/!d" -e "/$startline/d" -e "/$endline/d"

Redirect your sample input to this script for example like this: 将示例输入重定向到此脚本,例如:

sh script.sh < sample.txt

The awk in the middle is just to put a newline after all > and before < , because the sed script works only if the start and end tags are on their own line alone. 中间的awk只是在><之前放置换行符,因为sed脚本仅在start和end标记仅位于其自己的行时才起作用。 (To be honest this is really not a great script.) (说实话,这确实不是一个很好的脚本。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM