繁体   English   中英

grep 和 sed 正则表达式的含义 - 从网页中提取 url

[英]grep and sed regular expressions meaning - extracting urls from a web page

grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | sed -e 's/^.*"\([^"]\+\)".*$/\1/g'

在网上搜索我的作业问题的答案后,我终于得到了上述答案。 但我并不完全理解与 sed 和 grep 一起使用的两个正则表达式的含义。 有人可以对我有所了解吗? 提前致谢。

grep命令查找包含匹配项的任何行

'<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"'

这是

<a     the characters <a
[^>]   not followed by a close '>'
\+     the last thing one or more times (this is really not necessary I think.
       with this, it would be "not followed by exactly one '>' which would be fine
href   followed by the string 'href'
[ ]*   followed by zero or more spaces (you don't really need the [], just ' *' would be enough)
=      followed by the equals sign
[ \t]* followed by zero or more space or tab ("white space")
"      followed by open quote (but only a double quote...)
\(     open bracket (grouping)
ht     characters 'ht'
\|     or
f      character f
\)     close group (of the either-or)
tp     characters 'tp'
s\?    optionally followed by s
       Note - the last few lines combined means 'http or https or ftp or ftps'
:      character :
[^"]\+ one or more characters that are not a double quote
       this is "everything until the next quote"

这会让你开始吗? 你可以对下一点做同样的事情......

请注意混淆您 - 反斜杠用于更改某些特殊字符的含义,例如()+ ; 只是为了让每个人都保持警惕,无论是否带有反斜杠,它们是否具有特殊含义都不是由正则表达式语法定义的,而是由您使用它的命令(及其选项)定义的。 例如, sed会根据您是否使用-E标志来更改事物的含义。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM