如何使用sed正则表达式提取HTML并修改时间戳？

Question

Using curl + grep I get the following output: 使用curl + grep我得到以下输出：

<h3>Serial ID: L322607B2411012</h3>
<span>felipea</span>
<span>2015-10-05 20:06:43 UTC</span>

I'm new on sed-stuff and I want to use sed in order to get just the following output: 我是sed-stuff的新手，我想使用sed以获得以下输出：

L322607B2411012
felipea
20:06:43

I wrote the following regex in order to reach the result: 为了达到结果，我编写了以下正则表达式：

/<|>|h3|/|span| UTC|.......... /g

Tested on http://www.regexr.com/ with the text: 在http://www.regexr.com/上测试并显示以下文字：

<h3>Serial ID: L322607B2411012</h3>
<span>felipea</span>
<span>2015-10-05 20:06:43 UTC</span>
<h3>Serial ID: L322607B2411135</h3>
<span>tressino</span>
<span>2015-10-05 19:57:10 UTC</span>

And it highlighted the matches as needed (image: http://snag.gy/0ge60.jpg ), but it doesn't work when I do the real test, follow the command: 并且根据需要突出显示了匹配项（图片： http : //snag.gy/0ge60.jpg ），但是当我进行实际测试时，它不起作用，请遵循以下命令：

curl internalURL | egrep -i '(utc|Serial ID:|tressino|felipea)' | sed 's/<|>|h3|/|span| UTC|.......... /g'

The command above returns the normal output, same as without sed+regex. 上面的命令返回正常输出，与不使用sed + regex时相同。

Escaping the slash it returns the following error: 转义斜杠将返回以下错误：

sed 's/<|>|h3|\/|span| UTC|.......... /g'
sed: -e expression #1, char 35: unterminated `s' command

Can someone point out what I'm doing wrong? 有人可以指出我做错了吗？

Thanks in advance. 提前致谢。

Answer 1

Change the regex as follows: 如下更改正则表达式：

sed 's/<|>|h3|\/|span| UTC//g'

The substution command is s/.../.../ , where the first ellipsis ( ... ) is the pattern and the second one is the replacement. 订阅命令为s/.../.../ ，其中第一个省略号（ ... ）是模式，第二个省略号（ ... ）是替换模式。

Edit: As you are actually asking what's going wrong, here's an explanation: In the regex substitution s/<|>|h3|/|span| UTC|.......... /g 编辑：当您实际上在问出什么问题时，这是一个解释：在regex替换中s/<|>|h3|/|span| UTC|.......... /g s/<|>|h3|/|span| UTC|.......... /g , the pattern is <|>|h3| s/<|>|h3|/|span| UTC|.......... /g ，模式为<|>|h3| , ie. ，即。 < , > , h3 or nothing. < ， > ， h3或什么都没有。 The replacement is |span| UTC|.......... 替换为|span| UTC|.......... |span| UTC|.......... , which is what you get all over with -r option. |span| UTC|.......... ，这是使用-r选项可以解决的所有问题。

Answer 2

You will be better off using this simple awk command to get your text in between h3 and span tags: 使用以下简单的awk命令将文本置于h3和span标签之间会更好：

awk -F '</?(span|h3)>' '{print $2}' file
Serial ID: L322607B2411012
felipea
2015-10-05 20:06:43 UTC
Serial ID: L322607B2411135
tressino
2015-10-05 19:57:10 UTC

PS: Pipe to another to get your desired output: PS：传递到另一个以获得所需的输出：

awk -F '</?(span|h3)>' '{print $2}' file | awk '/ID:/{print $3;next} / UTC/{print $2;next} 1'
L322607B2411012
felipea
20:06:43
L322607B2411135
tressino
19:57:10

Though keep in mind that awk/sed/grep etc are not the best tools to parse HTML text. 尽管请记住， awk/sed/grep等不是解析HTML文本的最佳工具。

Answer 3

TL; TL; DR 博士

Don't parse HTML with regular expressions. 不要用正则表达式解析HTML。 Use a tool that supports XPath, such as XmlStarlet . 使用支持XPath的工具，例如XmlStarlet 。

Example Using XmlStarlet 使用XmlStarlet的示例

Given well-formed input such as: 给出格式正确的输入，例如：

<html>
  <body>
    <h3>Serial ID: L322607B2411012</h3>
    <span>felipea</span>
    <span>2015-10-05 20:06:43 UTC</span>
    <h3>Serial ID: L322607B2411135</h3>
    <span>tressino</span>
    <span>2015-10-05 19:57:10 UTC</span>
  </body>
</html>

you can use XPath to extract the text nodes you want. 您可以使用XPath提取所需的文本节点。 For example: 例如：

$ xmlstarlet sel -t -v '//h3/text() | //span/text()' -n /tmp/foo.html
Serial ID: L322607B2411012
felipea
2015-10-05 20:06:43 UTC
Serial ID: L322607B2411135
tressino
2015-10-05 19:57:10 UTC

You can then munge your timestamps and break your output into records with sed or awk. 然后，您可以修改时间戳，并使用sed或awk将输出分成记录。 As an example, consider this one-liner: 例如，考虑一下这种单线：

$ xmlstarlet sel -t -v '//h3/text() | //span/text()' -n /tmp/foo.html |
    awk '/UTC$/ {print $2 "\n"; next}; {print}'
Serial ID: L322607B2411012
felipea
20:06:43

Serial ID: L322607B2411135
tressino
19:57:10

如何使用sed正则表达式提取HTML并修改时间戳？

问题描述

3 个解决方案

解决方案1
1 已采纳 2015-10-05 21:45:43

解决方案2
0 2015-10-05 21:46:11

解决方案3
0 2015-10-05 22:30:42

TL; TL; DR 博士

Example Using XmlStarlet 使用XmlStarlet的示例

如何使用sed正则表达式提取HTML并修改时间戳？

问题描述

3 个解决方案

解决方案1 1 已采纳 2015-10-05 21:45:43

解决方案2 0 2015-10-05 21:46:11

解决方案3 0 2015-10-05 22:30:42

TL; TL; DR 博士

Example Using XmlStarlet 使用XmlStarlet的示例

解决方案1
1 已采纳 2015-10-05 21:45:43

解决方案2
0 2015-10-05 21:46:11

解决方案3
0 2015-10-05 22:30:42