简体   繁体   English

通过迭代从文本文件中提取html标签,并将其附加到列表中,并忽略python中的所有其他字符

[英]Extract html tags from a text file through iteration and append them to a list and ignore all other characters in python

I want to be able to read a html file and extract only the tags out of it. 我希望能够读取html文件并仅从其中提取标签。

  1. Read one character at a time from the file, ignoring everything to get "<"(ignore "<" as well) 从文件中一次读取一个字符,忽略所有内容以获取“ <”(也忽略“ <”)
  2. Read one character at a time, appending them to a string until ">" or white space(ignore ">" as well) 一次读取一个字符,并将其附加到字符串中,直到“>”或空白(也忽略“>”)

      <html> <body> <h1>This is test</h1> <h2> This is test 2<h2> </body> <html> with open('doc.txt', 'r') as f: all_lines = [] # loop through all lines using f.readlines() method for line in f.readlines(): new_line = [] # this is how you would loop through each alphabet for chars in line: new_line.append(chars) all_lines.append(new_line) print(all_lines) 

I can iterate through the text files and can get the list as below: 我可以遍历文本文件并获得如下列表:

[['<', 'h', 't', 'm', 'l', '>', '\\n'], ['<', 'b', 'o', 'd', 'y', '>', '\\n'], ['<', '/', 'b', 'o', 'd', 'y', '>', '\\n'], ['<', '/', 'h', 't', 'm', 'l', '>']] [['<','h','t','m','l','>','\\ n'],['<','b','o','d',' y','>','\\ n'],['<','/','b','o','d','y','>','\\ n'],[' <','/','h','t','m','l','>']]

but the expected output should be : [html,body,h1,/h1,/h2,/body,/html] 但预期的输出应为:[html,body,h1,/ h1,/ h2,/ body,/ html]

In [10]: re.findall('<(.*?)>', html)
Out[10]: ['html', 'body', 'h1', '/h1', 'h2', 'h2', '/body', '/html']

Simply use regex or a HTMLParser. 只需使用正则表达式或HTMLParser。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从 HTML 文件中提取标签并将它们写入新文件? - How to extract tags from HTML file and write them to a new file? 无法使用 Python 从 html 文件中提取查找/提取所有标签 - Unable to extract find/extract all tags from html file using Python 在检查粗体时从 HTML 文件中提取所有文本(Python) - Extract all text from HTML file while checking for boldness (Python) 如何遍历 webelement 以从 Selenium Web Automation (Python) 中的 HTML 标签中提取文本? - How to iterate through webelements to extract text from HTML tags in Selenium Web Automation (Python)? 请使用Python Regex帮助从HTML标记中提取文本 - Please help extract text from HTML tags using Python Regex 使用 python 根据前后字符提取部分文本 (html) 文件 - Extract parts of text (html) file based on characters before & after with python 从python文件中提取函数并将其写入其他文件 - Extract functions from python file and write them to other files Python:通过<a>具有href和文本内容的</a> html文件抓取<a>标签进行</a>搜索 - Python: searching through html file grabbing <a> tags with the href and text content 从html文件python中提取文本 - extract text from html file python 使用BeautifulSoup / Python从html文件中提取文本 - Extract text from html file with BeautifulSoup/Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM