简体   繁体   English

从文件中提取正则表达式捕获组的匹配项

[英]extract matches of a regex capturing group from a file

I want to perform the title-named action under linux command-line(several ca bash script will also do). 我想在linux命令行下执行标题命名的操作(几个ca bash脚本也会这样做)。 the command I tried is: 我试过的命令是:

sed 's/href="([^"])"/$1/g' page.html > list.lst

but obviously it failed. 但显然它失败了。

To be precise, here is my input: 确切地说,这是我的意见:

<link rel="stylesheet" type="text/css" href="style/css/colors.css" />
<link rel="stylesheet" type="text/css" href="style/css/global.css" />
<link rel="stylesheet" type="text/css" href="style/css/icons.css" />

the output I want would be a comma-separated or space-separated list of all matches in the input file: 我想要的输出是输入文件中所有匹配的逗号分隔或空格分隔列表:

style/css/colors.css,style/css/global.css,style/css/icons.css

I think I got the right expression: href="([^"]*)" 我想我得到了正确的表达:href =“([^”] *)“

but I have no clue how to perform this. 但我不知道如何执行此操作。 sed would do a search/replace which is not exactly what I want.( to the contrary, I only need to keep matches and throw the rest away, and not to replace them ) sed将进行搜索/替换,这不是我想要的。(相反,我只需要保持匹配并抛弃其余部分,而不是替换它们)

grep href page.html | sed 's/^.*href="\([^"]*\)".*$/\1/' | xargs | sed 's/ /,/g'

This will extract all the lines that contain href in them and will only get the first href on each line. 这将提取包含href所有行,并且只会在每行上获得第一个href Also, refer to this post about parsing HTML with regular expressions. 另外,请参阅此文章,了解如何使用正则表达式解析HTML。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM