简体   繁体   English

如何在python中获取html2text的清晰输出?

[英]How to get a clear output of html2text in python?

I have the following python program: 我有以下python程序:

import urllib.request as urllib2
import html2text

html = urllib2.urlopen("http://www.stern.de/")
page_source = html.read()

h = html2text.HTML2Text()
h.ignore_links = True
h.ignore_images = True

text = h.handle(str(page_source))

print (text)

The output is: 输出为:

\n \n\n

    * \n Anmelden
\n\n

    * \n 

Sie haben noch keinen Account?

\n Kostenlos neu registrieren

\n \n

\n

How can I filter out the "\\n"? 如何过滤出“ \\ n”?

I tried it for example this way and it don't work: 我以这种方式尝试了例如,但它不起作用:

wordList = text.split()

for word in wordList:
    if word != "\n":
        print (word)

This is the output after splitting: 这是拆分后的输出:

\n\n
*
\n
Anmelden
\n\n
*
\n
Sie
haben
noch
keinen
Account?
\n
Kostenlos
neu
registrieren
\n
\n
\n

So my check did not work. 因此我的支票无效。 How can I check for the \\n newline symbol? 如何检查\\ n换行符?

好的,我这样解决了,因为我调试了它,发现\\ n处于调试模式\\ n。

text = text.replace('\\n', '')

您尝试过用replace吗?

text.replace('\n', '')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM