[英]What's wrong with this shell/sed script?
I have about 150 HTML files in a given directory that I'd like to make some changes to. 我要对给定目录中的大约150个HTML文件进行一些更改。 Some of the anchor tags have an href along the following lines:
index.php?page=something
. 一些锚标记具有沿以下几行的href:
index.php?page=something
。 I'd like all of those to be changed to something.html
. 我希望将所有这些都更改为
something.html
。 Simple regex, simple script. 简单的正则表达式,简单的脚本。 I can't seem to get it correct, though.
不过,我似乎无法正确理解。 Can somebody weigh in on what I'm doing wrong?
有人可以对我做错了吗?
Sample html, before and after output: 输出之前和之后的示例html:
<!-- Before -->
<ul>
<li><a href="#">Apple</a></li>
<li><a href="index.php?page=dandelion">Dandelion</a></li>
<li><a href="index.php?page=elephant">Elephant</a></li>
<li><a href="index.php?page=resonate">Resonate</a></li>
</ul>
<!-- After -->
<ul>
<li><a href="#">Apple</a></li>
<li><a href="dandelion.html">Dandelion</a></li>
<li><a href="elephant.html">Elephant</a></li>
<li><a href="resonate.html">Resonate</a></li>
</ul>
Script file: 脚本文件:
#! /bin/bash
for f in *.html
do
sed s/\"index\.php?page=\([.]*\)\"/\1\.html/g < $f >! $f
done
It's your regex, and the fact that the shell is trying to interpret bits of your regex. 这是您的正则表达式,并且外壳程序正在尝试解释您的正则表达式的事实。
First - the [.]*
matches any number of literal dots .
首先-
[.]*
匹配任意数量的文字点.
. 。 Change it to
.*
. 将其更改为
.*
。
Secondly, enclose the entire regex in single quotes '
to prevent the bash shell from interpreting any of it. 其次,包围整个正则表达式中的单引号
'
,以防止在bash shell解释它的任何。
sed 's/"index\.php?page=\(.*\)"/\1\.html/g'
Also, instead of < $f >! $f
另外,代替
< $f >! $f
< $f >! $f
you can just feed in the '-i' switch to sed to have it operate in-place: < $f >! $f
您只需将'-i'开关输入sed即可使其就地运行:
sed -i 's/"index\.php?page=\(.*\)"/"\1\.html"/g' "$f"
(Also, as another point I think in your replacement you want double quotes around the \\1.html
so that the new URL is quoted within the HTML. I also quoted your $f
to "$f"
, because if the file name contains spaces bash will complain). (另外,我想在替换中,您需要在
\\1.html
周围用双引号引起来,以便在HTML \\1.html
新的引号引起来。我还将$f
引用为"$f"
,因为如果文件名包含空格bash会抱怨)。
EDIT : as @TimPote notes, the standard way to match something within quotes is either ".*?"
编辑 :正如@TimPote所指出的,在引号内匹配内容的标准方法是
".*?"
(so that the .*
is non-greedy) or "[^"]+"
. Sed doesn't support the former, so try: (因此
.*
是非贪婪的)或"[^"]+"
。Sed不支持前者,因此请尝试:
sed -i 's/"index\.php?page=\([^"]\+\)"/"\1\.html"/g' "$f"
This is to prevent (for example) <a href="index.php?page=asdf">"asdf"</a>
from being turned into <a href="asdf">"asdf.html"</a>
(where the (.*)
captured asdf">"asdf
, being greedy). 这是为了防止(例如)
<a href="index.php?page=asdf">"asdf"</a>
变成<a href="asdf">"asdf.html"</a>
(其中(.*)
捕获asdf">"asdf
表示贪婪)。
Your .*
was too greedy. 您的
.*
太贪婪。 Use [^"]\\+
instead. Plus your quotes were all messed up. Surround the whole thing with single quotes instead, then you can use "
without escaping them. 使用
[^"]\\+
代替。加上您的引号都被弄乱了。整个内容都用单引号引起来,然后您可以使用"
而不必转义。
sed -i 's/"index\.php?page=\([^"]\+\)"/"\1\.html"/g'
You can do this whole operation with a single statement using find
: 您可以使用
find
使用单个语句完成整个操作:
find . -maxdepth 1 -type f -name '*.html' \
-exec sed -i 's/"index\.php?page=\([^"]\+\)"/"\1\.html"/g' {} \+
The following works: 以下作品:
sed "s/\"index\.php?page=\(.*\)\"/\"\1.html\"/g" < 1.html
I think it was mostly the square brackets. 我认为主要是方括号。 Not sure why you had them.
不知道为什么要拥有它们。 Oh, and the entire sed command needs to be in quotes.
哦,整个sed命令需要用引号引起来。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.