[英]Extract html tags from a text file through iteration and append them to a list and ignore all other characters in python
I want to be able to read a html file and extract only the tags out of it. 我希望能够读取html文件并仅从其中提取标签。
Read one character at a time, appending them to a string until ">" or white space(ignore ">" as well) 一次读取一个字符,并将其附加到字符串中,直到“>”或空白(也忽略“>”)
<html> <body> <h1>This is test</h1> <h2> This is test 2<h2> </body> <html> with open('doc.txt', 'r') as f: all_lines = [] # loop through all lines using f.readlines() method for line in f.readlines(): new_line = [] # this is how you would loop through each alphabet for chars in line: new_line.append(chars) all_lines.append(new_line) print(all_lines)
I can iterate through the text files and can get the list as below: 我可以遍历文本文件并获得如下列表:
[['<', 'h', 't', 'm', 'l', '>', '\\n'], ['<', 'b', 'o', 'd', 'y', '>', '\\n'], ['<', '/', 'b', 'o', 'd', 'y', '>', '\\n'], ['<', '/', 'h', 't', 'm', 'l', '>']] [['<','h','t','m','l','>','\\ n'],['<','b','o','d',' y','>','\\ n'],['<','/','b','o','d','y','>','\\ n'],[' <','/','h','t','m','l','>']]
but the expected output should be : [html,body,h1,/h1,/h2,/body,/html] 但预期的输出应为:[html,body,h1,/ h1,/ h2,/ body,/ html]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.