简体   繁体   English

如何使用sed正则表达式提取HTML并修改时间戳?

[英]How can I use a sed regex to extract HTML and modify a timestamp?

Using curl + grep I get the following output: 使用curl + grep我得到以下输出:

<h3>Serial ID: L322607B2411012</h3>
<span>felipea</span>
<span>2015-10-05 20:06:43 UTC</span>

I'm new on sed-stuff and I want to use sed in order to get just the following output: 我是sed-stuff的新手,我想使用sed以获得以下输出:

L322607B2411012
felipea
20:06:43

I wrote the following regex in order to reach the result: 为了达到结果,我编写了以下正则表达式:

/<|>|h3|/|span| UTC|.......... /g

Tested on http://www.regexr.com/ with the text: http://www.regexr.com/上测试并显示以下文字:

<h3>Serial ID: L322607B2411012</h3>
<span>felipea</span>
<span>2015-10-05 20:06:43 UTC</span>
<h3>Serial ID: L322607B2411135</h3>
<span>tressino</span>
<span>2015-10-05 19:57:10 UTC</span>

And it highlighted the matches as needed (image: http://snag.gy/0ge60.jpg ), but it doesn't work when I do the real test, follow the command: 并且根据需要突出显示了匹配项(图片: http : //snag.gy/0ge60.jpg ),但是当我进行实际测试时,它不起作用,请遵循以下命令:

curl internalURL | egrep -i '(utc|Serial ID:|tressino|felipea)' | sed 's/<|>|h3|/|span| UTC|.......... /g'

The command above returns the normal output, same as without sed+regex. 上面的命令返回正常输出,与不使用sed + regex时相同。

Escaping the slash it returns the following error: 转义斜杠将返回以下错误:

sed 's/<|>|h3|\/|span| UTC|.......... /g'
sed: -e expression #1, char 35: unterminated `s' command

Can someone point out what I'm doing wrong? 有人可以指出我做错了吗?

Thanks in advance. 提前致谢。

Change the regex as follows: 如下更改正则表达式:

sed 's/<|>|h3|\/|span| UTC//g'

The substution command is s/.../.../ , where the first ellipsis ( ... ) is the pattern and the second one is the replacement. 订阅命令为s/.../.../ ,其中第一个省略号( ... )是模式,第二个省略号( ... )是替换模式。

Edit: As you are actually asking what's going wrong, here's an explanation: In the regex substitution s/<|>|h3|/|span| UTC|.......... /g 编辑:当您实际上在问出什么问题时,这是一个解释:在regex替换中s/<|>|h3|/|span| UTC|.......... /g s/<|>|h3|/|span| UTC|.......... /g , the pattern is <|>|h3| s/<|>|h3|/|span| UTC|.......... /g ,模式为<|>|h3| , ie. ,即。 < , > , h3 or nothing. <>h3或什么都没有。 The replacement is |span| UTC|.......... 替换为|span| UTC|.......... |span| UTC|.......... , which is what you get all over with -r option. |span| UTC|.......... ,这是使用-r选项可以解决的所有问题。

You will be better off using this simple awk command to get your text in between h3 and span tags: 使用以下简单的awk命令将文本置于h3span标签之间会更好:

awk -F '</?(span|h3)>' '{print $2}' file
Serial ID: L322607B2411012
felipea
2015-10-05 20:06:43 UTC
Serial ID: L322607B2411135
tressino
2015-10-05 19:57:10 UTC

PS: Pipe to another to get your desired output: PS:传递到另一个以获得所需的输出:

awk -F '</?(span|h3)>' '{print $2}' file | awk '/ID:/{print $3;next} / UTC/{print $2;next} 1'
L322607B2411012
felipea
20:06:43
L322607B2411135
tressino
19:57:10

Though keep in mind that awk/sed/grep etc are not the best tools to parse HTML text. 尽管请记住, awk/sed/grep等不是解析HTML文本的最佳工具。

TL; TL; DR 博士

Don't parse HTML with regular expressions. 不要用正则表达式解析HTML。 Use a tool that supports XPath, such as XmlStarlet . 使用支持XPath的工具,例如XmlStarlet

Example Using XmlStarlet 使用XmlStarlet的示例

Given well-formed input such as: 给出格式正确的输入,例如:

<html>
  <body>
    <h3>Serial ID: L322607B2411012</h3>
    <span>felipea</span>
    <span>2015-10-05 20:06:43 UTC</span>
    <h3>Serial ID: L322607B2411135</h3>
    <span>tressino</span>
    <span>2015-10-05 19:57:10 UTC</span>
  </body>
</html>

you can use XPath to extract the text nodes you want. 您可以使用XPath提取所需的文本节点。 For example: 例如:

$ xmlstarlet sel -t -v '//h3/text() | //span/text()' -n /tmp/foo.html
Serial ID: L322607B2411012
felipea
2015-10-05 20:06:43 UTC
Serial ID: L322607B2411135
tressino
2015-10-05 19:57:10 UTC

You can then munge your timestamps and break your output into records with sed or awk. 然后,您可以修改时间戳,并使用sed或awk将输出分成记录。 As an example, consider this one-liner: 例如,考虑一下这种单线:

$ xmlstarlet sel -t -v '//h3/text() | //span/text()' -n /tmp/foo.html |
    awk '/UTC$/ {print $2 "\n"; next}; {print}'
Serial ID: L322607B2411012
felipea
20:06:43

Serial ID: L322607B2411135
tressino
19:57:10

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM