使用 BeautifulSoup 以与 html 中相同的格式从 html 中提取文本

Question

Code:代码：

body_text = BeautifulSoup(open(html)).text

In the html page, line like 1. ETA basis expected metocean conditions gets splitted into lines while extracting, need to resolve this.在html页面中，像1这样的行在提取时ETA基于预期的海洋条件被拆分成行，需要解决这个问题。

I used string formatting conditions like我使用了字符串格式条件，如

body_html = str(BeautifulSoup(open(html_file)))
body_html = body_html.replace('\n', ' ') #to remove all new lines
body_html = body_html.replace('/>', '/>\n') # add new lines so that texts from two different tags do not extracted in same line

Sample HTML pages: https://easyupload.io/sh02xi示例 HTML 页面： https : //easyupload.io/sh02xi

Is there any better way to extract texts in the same format as we visualize in html?有没有更好的方法来提取与我们在 html 中可视化的格式相同的文本？

Answer 1

html2text works better for text extraction, when texts needs to be extracted in the same format as in html: html2text更适合文本提取，当文本需要以与 html 中相同的格式提取时：

h = html2text.HTML2Text()
h.body_width = 0
h.ignore_links = True
h.ignore_images = True
print(h.handle(html_string))

使用 BeautifulSoup 以与 html 中相同的格式从 html 中提取文本

问题描述

1 个解决方案

解决方案1
0 2020-08-27 11:26:53

使用 BeautifulSoup 以与 html 中相同的格式从 html 中提取文本

问题描述

1 个解决方案

解决方案1 0 2020-08-27 11:26:53

解决方案1
0 2020-08-27 11:26:53