[英]Extract html tags from a text file through iteration and append them to a list and ignore all other characters in python
我希望能够读取html文件并仅从其中提取标签。
一次读取一个字符,并将其附加到字符串中,直到“>”或空白(也忽略“>”)
<html> <body> <h1>This is test</h1> <h2> This is test 2<h2> </body> <html> with open('doc.txt', 'r') as f: all_lines = [] # loop through all lines using f.readlines() method for line in f.readlines(): new_line = [] # this is how you would loop through each alphabet for chars in line: new_line.append(chars) all_lines.append(new_line) print(all_lines)
我可以遍历文本文件并获得如下列表:
[['<','h','t','m','l','>','\\ n'],['<','b','o','d',' y','>','\\ n'],['<','/','b','o','d','y','>','\\ n'],[' <','/','h','t','m','l','>']]
但预期的输出应为:[html,body,h1,/ h1,/ h2,/ body,/ html]
In [10]: re.findall('<(.*?)>', html)
Out[10]: ['html', 'body', 'h1', '/h1', 'h2', 'h2', '/body', '/html']
只需使用正则表达式或HTMLParser。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.