简体   繁体   English

在Python中从网页提取文本时如何避免注释

[英]how to avoid comments while extracting text from web pages in Python

I am trying to extract only text from web page but i am facing some problem like texts which are not in written in the page but they are written in code which are comments such as: "include footer", "sidebar.php end" etc. Also the unwanted things are also coming which i really do not want. 我正在尝试仅从网页中提取文本,但是我遇到了一些问题,例如文本不是写在页面中,而是用注释的代码编写,例如:“ include footer”,“ sidebar.php end”等还有我真的不想要的不需要的东西也来了。 Here are the links which i am using for test case ie: 这是我用于测试用例的链接,即:

1) http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html 1) http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html

2) http://www.tutorialspoint.com/cplusplus/index.htm 2) http://www.tutorialspoint.com/cplusplus/index.htm

3) http://www.cplusplus.com/doc/tutorial/program_structure/ 3) http://www.cplusplus.com/doc/tutorial/program_structure/

(so that i can ensure my code is extracting text from any page) (以便我可以确保我的代码从任何页面提取文本)

here is the code which i am facing trouble: 这是我遇到麻烦的代码:

import urllib
from bs4 import BeautifulSoup
url = "http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/" 
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

for script in soup(["script", "style","a","p","li","<!-->","small","<div id=\"footer\">","<div id=\"footer\">","<div id=\"bottom\">"]):
    script.extract()    

text = soup.findAll(text=True)
for p in text:
    print unicode(p)
fo = open('file.txt', 'w')
fo.seek(0, 2)
fo.writelines( unicode(p) )
fo.close()

in this code i have used number 1 link and when i did "inspect element" on that page i found so many comments in that code and this code is extracting them as well. 在这段代码中,我使用了1号链接,当我在该页面上执行“检查元素”时,在该代码中发现了很多注释,并且此代码也提取了它们。 So help please..... 所以请帮助.....

One way would be to use a regex to strip/skip comments when your code encounters a line that the regex matches as a comment. 一种方法是,当您的代码遇到正则表达式作为注释匹配的行时,使用正则表达式去除/跳过注释。

Alternatively, you might be able to use an HTML parser as well. 或者,您也可以使用HTML解析器。 Python has one built into its standard library. Python在其标准库中内置了一个。

https://docs.python.org/2/library/htmlparser.html https://docs.python.org/2/library/htmlparser.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM