通过迭代从文本文件中提取html标签，并将其附加到列表中，并忽略python中的所有其他字符

Question

我希望能够读取html文件并仅从其中提取标签。

从文件中一次读取一个字符，忽略所有内容以获取“ <”（也忽略“ <”）

一次读取一个字符，并将其附加到字符串中，直到“>”或空白（也忽略“>”）

  <html> <body> <h1>This is test</h1> <h2> This is test 2<h2> </body> <html> with open('doc.txt', 'r') as f: all_lines = [] # loop through all lines using f.readlines() method for line in f.readlines(): new_line = [] # this is how you would loop through each alphabet for chars in line: new_line.append(chars) all_lines.append(new_line) print(all_lines)

我可以遍历文本文件并获得如下列表：

[['<'，'h'，'t'，'m'，'l'，'>'，'\\ n']，['<'，'b'，'o'，'d'，' y'，'>'，'\\ n']，['<'，'/'，'b'，'o'，'d'，'y'，'>'，'\\ n']，[' <'，'/'，'h'，'t'，'m'，'l'，'>']]

但预期的输出应为：[html，body，h1，/ h1，/ h2，/ body，/ html]

Answer 1

In [10]: re.findall('<(.*?)>', html)
Out[10]: ['html', 'body', 'h1', '/h1', 'h2', 'h2', '/body', '/html']

只需使用正则表达式或HTMLParser。

通过迭代从文本文件中提取html标签，并将其附加到列表中，并忽略python中的所有其他字符

问题描述

1 个解决方案

解决方案1
0 2018-09-08 21:57:04

通过迭代从文本文件中提取html标签，并将其附加到列表中，并忽略python中的所有其他字符

问题描述

1 个解决方案

解决方案1 0 2018-09-08 21:57:04

解决方案1
0 2018-09-08 21:57:04