简体   繁体   English

我可以只获取与Grep匹配的字符串部分吗

[英]Can I get only the part of the string that matches with Grep

I have some html that I would like to pull a URL from using grep. 我有一些HTML,我想从使用grep提取URL。 Is there an elegant way to do this? 有没有一种优雅的方法可以做到这一点? So far, I'm using wget to dump the html into a tmp.html file. 到目前为止,我正在使用wget将html转储到tmp.html文件中。 Then, this is what I'm doing: 然后,这就是我正在做的:

awk '/<a href=/,/<\/a\>/' tmp.html | grep -v "sha1|md5" |grep -E "*.rpm?" | tail -1

Given a list of the following types of string, I'd like to pull out only the last .rpm URL on the list. 给定以下字符串类型的列表,我只想提取列表上的最后一个.rpm URL。

<td><a href="http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm">something-0.0.1-20150227.161014-81-sles11_64.rpm</a></td>

Using GNU awk for the 3rd arg to match() and given this input file: 为第三个arg使用GNU awk match()并给出以下输入文件:

$ cat file
<td><a href="http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm">something-0.0.1-20150227.161014-81-sles11_64.rpm</a></td>

This might be what you want: 这可能是您想要的:

$ cat tst.awk         
match($0,/<a href=.*>(.*\.rpm)<\/a\>/,a) && !/sha1|md5/ {url=a[1]} END{print url}

$ gawk -f tst.awk file
something-0.0.1-20150227.161014-81-sles11_64.rpm

or this: 或这个:

$ cat tst.awk
match($0,/<a href="([^"]+\.rpm)".*<\/a\>/,a) && !/sha1|md5/ {url=a[1]} END{print url}

$ gawk -f tst.awk file
http://maven-whatever:8081/nexus/content/repositories/snapshots/com/whatever/whatever/adv-svcs/something/0.0.1-SNAPSHOT/something-0.0.1-20150227.161014-81-sles11_64.rpm

but without more sample input and the expected output it's a guess. 但没有更多示例输入和预期输出,这只是一个猜测。

The -o option causes grep to print out only the matches, instead of the full line which matches. -o选项使grep仅输出匹配项,而不打印匹配的完整行。 If there is more than one match in a line, all of them will be printed. 如果一行中有多个匹配项,则将全部打印。

*.rpm? is not a regular expression. 不是正则表达式。 If you want to make the match meaningful, you'll need to be quite precise; 如果您想使比赛有意义,则需要非常精确。 possibly something like 可能像

grep -o '"[^"]*.rpm"'

will give you more or less what you are looking for (but it will output the quotes as well, and will not deal with % -escapes in the URL. 会给你更多的还是少了什么,你正在寻找(但它将输出报价为好,且不会涉及%的URL -escapes。

You could probably do better with awk , since you are using that anyway. 使用awk可能会做得更好,因为无论如何都在使用它。

Parsing HTML with regular expressions is never going to be as robust nor as easy as using a real HTML parser, as has been observed frequently here . 用正则表达式解析HTML永远不会像使用真正的HTML解析器那样健壮或容易,就像在这里经常观察到的那样

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM