XPath：提取 html 页面中的所有标签

Question

我是 XPath 的新手，我遇到了问题。 我想提取 web 页面上的所有且仅 html 标记。

例子：

<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>

我的 output 应该是：

["<html>", "<body>","<h1>","</h1>","<p>","</p>","</body>"."</html>"]

Answer 1

尝试使用正则表达式和re.findall function：

>>> import re
>>> s = '''<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>'''
>>> re.findall('<.*?>', s)
['<html>', '<body>', '<h1>', '</h1>', '<p>', '</p>', '</body>', '</html>']
>>>

XPath：提取 html 页面中的所有标签

问题描述

1 个解决方案

解决方案1
1 2021-01-04 12:21:00

XPath：提取 html 页面中的所有标签

问题描述

1 个解决方案

解决方案1 1 2021-01-04 12:21:00

解决方案1
1 2021-01-04 12:21:00