通过迭代从文本文件中提取html标签，并将其附加到列表中，并忽略python中的所有其他字符

Question

I want to be able to read a html file and extract only the tags out of it. 我希望能够读取html文件并仅从其中提取标签。

Read one character at a time from the file, ignoring everything to get "<"(ignore "<" as well) 从文件中一次读取一个字符，忽略所有内容以获取“ <”（也忽略“ <”）

Read one character at a time, appending them to a string until ">" or white space(ignore ">" as well) 一次读取一个字符，并将其附加到字符串中，直到“>”或空白（也忽略“>”）

  <html> <body> <h1>This is test</h1> <h2> This is test 2<h2> </body> <html> with open('doc.txt', 'r') as f: all_lines = [] # loop through all lines using f.readlines() method for line in f.readlines(): new_line = [] # this is how you would loop through each alphabet for chars in line: new_line.append(chars) all_lines.append(new_line) print(all_lines)

I can iterate through the text files and can get the list as below: 我可以遍历文本文件并获得如下列表：

[['<', 'h', 't', 'm', 'l', '>', '\\n'], ['<', 'b', 'o', 'd', 'y', '>', '\\n'], ['<', '/', 'b', 'o', 'd', 'y', '>', '\\n'], ['<', '/', 'h', 't', 'm', 'l', '>']] [['<'，'h'，'t'，'m'，'l'，'>'，'\\ n']，['<'，'b'，'o'，'d'，' y'，'>'，'\\ n']，['<'，'/'，'b'，'o'，'d'，'y'，'>'，'\\ n']，[' <'，'/'，'h'，'t'，'m'，'l'，'>']]

but the expected output should be : [html,body,h1,/h1,/h2,/body,/html] 但预期的输出应为：[html，body，h1，/ h1，/ h2，/ body，/ html]

Answer 1

In [10]: re.findall('<(.*?)>', html)
Out[10]: ['html', 'body', 'h1', '/h1', 'h2', 'h2', '/body', '/html']

Simply use regex or a HTMLParser. 只需使用正则表达式或HTMLParser。

通过迭代从文本文件中提取html标签，并将其附加到列表中，并忽略python中的所有其他字符

问题描述

1 个解决方案

解决方案1
0 2018-09-08 21:57:04

通过迭代从文本文件中提取html标签，并将其附加到列表中，并忽略python中的所有其他字符

问题描述

1 个解决方案

解决方案1 0 2018-09-08 21:57:04

解决方案1
0 2018-09-08 21:57:04