我可以只获取与Grep匹配的字符串部分吗

Question

I have some html that I would like to pull a URL from using grep. 我有一些HTML，我想从使用grep提取URL。 Is there an elegant way to do this? 有没有一种优雅的方法可以做到这一点？ So far, I'm using wget to dump the html into a tmp.html file. 到目前为止，我正在使用wget将html转储到tmp.html文件中。 Then, this is what I'm doing: 然后，这就是我正在做的：

awk '/<a href=/,/<\/a\>/' tmp.html | grep -v "sha1|md5" |grep -E "*.rpm?" | tail -1

Given a list of the following types of string, I'd like to pull out only the last .rpm URL on the list. 给定以下字符串类型的列表，我只想提取列表上的最后一个.rpm URL。

<td><a href="http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm">something-0.0.1-20150227.161014-81-sles11_64.rpm</a></td>

Answer 1

Using GNU awk for the 3rd arg to match() and given this input file: 为第三个arg使用GNU awk match（）并给出以下输入文件：

$ cat file
<td><a href="http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm">something-0.0.1-20150227.161014-81-sles11_64.rpm</a></td>

This might be what you want: 这可能是您想要的：

$ cat tst.awk         
match($0,/<a href=.*>(.*\.rpm)<\/a\>/,a) && !/sha1|md5/ {url=a[1]} END{print url}

$ gawk -f tst.awk file
something-0.0.1-20150227.161014-81-sles11_64.rpm

or this: 或这个：

$ cat tst.awk
match($0,/<a href="([^"]+\.rpm)".*<\/a\>/,a) && !/sha1|md5/ {url=a[1]} END{print url}

$ gawk -f tst.awk file
http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm

but without more sample input and the expected output it's a guess. 但没有更多示例输入和预期输出，这只是一个猜测。

Answer 2

The -o option causes grep to print out only the matches, instead of the full line which matches. -o选项使grep仅输出匹配项，而不打印匹配的完整行。 If there is more than one match in a line, all of them will be printed. 如果一行中有多个匹配项，则将全部打印。

*.rpm? is not a regular expression. 不是正则表达式。 If you want to make the match meaningful, you'll need to be quite precise; 如果您想使比赛有意义，则需要非常精确。 possibly something like 可能像

grep -o '"[^"]*.rpm"'

will give you more or less what you are looking for (but it will output the quotes as well, and will not deal with % -escapes in the URL. 会给你更多的还是少了什么，你正在寻找（但它将输出报价为好，且不会涉及%的URL -escapes。

You could probably do better with awk , since you are using that anyway. 使用awk可能会做得更好，因为无论如何都在使用它。

Parsing HTML with regular expressions is never going to be as robust nor as easy as using a real HTML parser, as has been observed frequently here . 用正则表达式解析HTML永远不会像使用真正的HTML解析器那样健壮或容易，就像在这里经常观察到的那样。

我可以只获取与Grep匹配的字符串部分吗

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-02-27 19:22:20

解决方案2
1 2015-02-27 18:12:02

我可以只获取与Grep匹配的字符串部分吗

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-02-27 19:22:20

解决方案2 1 2015-02-27 18:12:02

解决方案1
2 已采纳 2015-02-27 19:22:20

解决方案2
1 2015-02-27 18:12:02