Linux / Cygwin：将模式替换为其他模式匹配的结果（sed / find？）

Question

I have a large website of documents that look like this: 我有一个大型的文档网站，如下所示：
<title>DOCTITLE</title>
<h1>Some Title</h1>

I'm trying to use Cygwin to replace DOCTITLE with Some Title in every file. 我正在尝试使用Cygwin在每个文件中将DOCTITLE替换为Some Title。

To be more specific, I need to extract whatever text is between <h1> tags from each file and replace the literal string "DOCTITLE" with the extracted text. 更具体地说，我需要从每个文件中提取<h1>标记之间的任何文本，并将文本字符串“ DOCTITLE”替换为提取的文本。

Here's one thought that doesn't work but illustrates the spirit of what I'm after: 这是一个行不通的想法，但说明了我所追求的精神：

find . -name "*html"  
       -exec sed -i 
                's/DOCTITLE/'$(grep "h1" | sed 's/<h1>\(.*\)<\/h1>/\1/')'/'
'{}' /;

Unsurprisingly, this fails because grep has no input and it would destroy the <h1> . 毫不奇怪，这会失败，因为grep没有输入，并且会破坏<h1> 。

Any ideas? 有任何想法吗？

Thanks for your time and expertise! 感谢您的时间和专业知识！

Answer 1

This might work for you (GNU sed): 这可能对您有用（GNU sed）：

find . -name "*html" -exec sed -i '$!N;s/DOCTITLE\([^\n]*\n<h1>\([^<]*\)<\/h1>\)/\2\1/;P;D' {}\;

This will need extensive testing first! 首先需要进行大量测试！

Answer 2

Your approach—using $( … ) —won't work, since sed 's -exec argument doesn't handle that syntax. 您的方法-使用$( … ) -不起作用，因为sed的-exec参数不处理该语法。 What we can do instead, however, is call bash to do that bit of work for us: 但是，我们可以做的是调用bash为我们做一些工作：

find . -name '*.html' -exec /bin/bash -c 'sed "s/DOCTITLE/$(sed -n '\''\,<h1>.*</h1>,{s,<h1>\(.*\)</h1>,\1,p;q}'\'' '\''{}'\'')/" "{}"' \;

The outer sed does exactly what your sed command does. 外部sed功能与sed命令的功能完全相同。 The inner $( … ) part is expanded by bash to produce only the text between the first <h1> s (it'd be much simpler if it didn't need to only get that first match). 内部$( … )部分由bash扩展以仅产生前 <h1> s之间的文本（如果不需要只获得第一个匹配项，则要简单得多）。

Specifically, that inner sed prints nothing by default (the -n ), then for lines that match the regex <h1>.*</h1> , it runs s,<h1>\$.*\$</h1>,\\1,p;q , ie strip the HTML tags, print the result, then quit; 具体来说，该内部sed默认情况下不打印任何内容（ -n ），然后对于与正则表达式<h1>.*</h1>匹配的行，它运行s,<h1>\$.*\$</h1>,\\1,p;q ，即剥离HTML标记，打印结果，然后退出； that q ensures we only print out the first match. q确保我们只打印出第一个匹配项。

Note I've avoided needing to use grep by using sed -n ; 注意我避免了通过使用sed -n来使用grep ； you could alteratively do the same thing with the below command, with the -m option to grep limiting the command to the first match. 您可以使用以下命令来交替执行相同的操作，并使用-m选项将grep限制为第一个匹配项。

find . -name '*.html' -exec /bin/bash -c 'sed "s/DOCTITLE/$(grep -m1 '\''<h1>.*</h1>'\'' '\''{}'\'' | sed '\''s,<h1>\(.*\)</h1>,\1,'\'')/" "{}"' \;

In both cases, there's some mildly horrific quoting going on: the '\\'' sequences are to insert a single quote into a single-quoted string. 在这两种情况下，都会出现一些轻微的引号： '\\''序列是在单引号字符串中插入单引号。 We need to quote the sed statements to ensure any spaces in the titles don't cause problems, and we need to quote the file names to be able to handle spaces in file names. 我们需要引用sed语句以确保标题中的任何空格都不会引起问题，并且我们需要引用文件名以便能够处理文件名中的空格。

Linux / Cygwin：将模式替换为其他模式匹配的结果（sed / find？）

问题描述

2 个解决方案

解决方案1
0 2012-07-11 07:38:47

解决方案2
0 已采纳 2012-07-11 13:05:53

Linux / Cygwin：将模式替换为其他模式匹配的结果（sed / find？）

问题描述

2 个解决方案

解决方案1 0 2012-07-11 07:38:47

解决方案2 0 已采纳 2012-07-11 13:05:53

解决方案1
0 2012-07-11 07:38:47

解决方案2
0 已采纳 2012-07-11 13:05:53