简体   繁体   English

使用 BeautifulSoup 以与 html 中相同的格式从 html 中提取文本

[英]Extract text from html using in the same format as in html using BeautifulSoup

Code:代码:

body_text = BeautifulSoup(open(html)).text

In the html page, line like 1. ETA basis expected metocean conditions gets splitted into lines while extracting, need to resolve this.在html页面中,像1这样的行在提取时ETA基于预期的海洋条件被拆分成行,需要解决这个问题。

I used string formatting conditions like我使用了字符串格式条件,如

body_html = str(BeautifulSoup(open(html_file)))
body_html = body_html.replace('\n', ' ') #to remove all new lines
body_html = body_html.replace('/>', '/>\n') # add new lines so that texts from two different tags do not extracted in same line

Sample HTML pages: https://easyupload.io/sh02xi示例 HTML 页面: https : //easyupload.io/sh02xi

Is there any better way to extract texts in the same format as we visualize in html?有没有更好的方法来提取与我们在 html 中可视化的格式相同的文本?

html2text works better for text extraction, when texts needs to be extracted in the same format as in html: html2text更适合文本提取,当文本需要以与 html 中相同的格式提取时:

h = html2text.HTML2Text()
h.body_width = 0
h.ignore_links = True
h.ignore_images = True
print(h.handle(html_string))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM