简体   繁体   English

sed / grep-获取两个字符串之间的文本(html)

[英]sed/grep - get text between two strings (html)

I am trying to extract "pagename" from the following: 我正在尝试从以下内容中提取“页面名称”:

<a class="timetable work" href="http://www.test.com/pagename?tag=meta376">Test</a>

I tried to get it to work using "sed" but it only says invalid command code. 我试图使用“ sed”使其工作,但它只显示无效的命令代码。

What line of code would you guys suggest to get the pagename? 你们建议使用哪一行代码获取页面名称? By the way: This is not a single line but there is more content on the same line - but that should not make a difference as it should just matter what is between the limiters, right? 顺便说一句:这不是一行,而是同一行上有更多内容-但这没有什么区别,因为限制器之间的关系是正确的,对吗?

Thanks in advance for helping me out! 在此先感谢您对我的帮助!

As you commented, if you want to extract "<a class="timetable work" href="test.com/"; and "?tag=meta376">Test</a>" you can use the following regex: 如您所评论,如果要提取"<a class="timetable work" href="test.com/"; and "?tag=meta376">Test</a>" ,则可以使用以下正则表达式:

<a class="timetable.*?<\/a>

Working demo 工作演示

If you want to grab the content just surround the regex with capturing groups: 如果要获取内容,只需将正则表达式与捕获组一起包围:

(<a class="timetable.*?<\/a>)

The match is: 匹配为:

MATCH 1
1.  [9-80]  `<a class="timetable work" href="test.com/"; and "?tag=meta376">Test</a>`

我认为这是您想要的:

sed 's_^.*<a [^<>]* href="https*://[^/]*/\([^"?]*\).*$_\1_'

Giving you exactly what you asked for using exactly the delimiters you told us to use: 使用您告诉我们使用的定界符,为您提供所需的确切信息:

$ sed -n 's|.*<a class="timetable work" href="http://www\.test\.com/\(.*\)?tag=meta376">Test</a>|\1|p' file
pagename

I know it may be tempting to handle this using a regular expression but here's an alternative. 我知道使用正则表达式来处理它可能很吸引人,但这是另一种选择。

You are trying to parse some HTML, so use an HTML parser. 您正在尝试解析一些HTML,因此请使用HTML解析器。 Here's an example in Perl: 这是Perl中的示例:

use strict;
use warnings;
use feature qw(say);

use HTML::TokeParser::Simple;
use URI::URL;

my $filename = 'file.html'; 
my $parser = HTML::TokeParser::Simple->new($filename);

while (my $anchor = $parser->get_tag('a')) {
    next unless defined(my $class = $anchor->get_attr('class'));
    next unless $class =~ /\btimetable\b/ and $class =~ /\bwork\b/;    
    my $url = url $anchor->get_attr('href');    
    say substr($url->path, 1);
}

Parse the HTML using HTML::TokeParser::Simple . 使用HTML::TokeParser::Simple解析HTML。 loop through the <a> tags, skipping any that don't have the correct classes defined. 遍历<a>标记,跳过所有未定义正确类的标记。 For the ones that do, use URI::URL to parse the url and extract the "path" component (which in your case, would be "/pagename"). 对于执行此操作的人,请使用URI::URL解析URL并提取“路径”组件(在您的情况下为“ / pagename”)。 As you didn't want the leading slash, I used substr to remove the first character. 由于您不希望使用斜杠,因此我使用substr删除了第一个字符。

Output: 输出:

pagename

I know it's much longer than a single regex but it's also a lot more robust and will continue to work even when the format of your HTML changes slightly in the future. 我知道这是不是一个单一的正则表达式长得多,但它也有很多更强大,将继续当你的HTML格式在未来稍有变化的工作。 HTML parsers exist for a reason :) HTML解析器的存在是有原因的:)

I would use awk for this: 我将为此使用awk

awk -F"[/?]" '/timetable work/ {print $4}'file
pagename

It search for a line containing timetable work , then print fourth field using \\ or ? 它搜索包含timetable work的行,然后使用\\?打印第四个字段? as separator. 作为分隔符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM