在Python中从网页提取文本时如何避免注释

Question

I am trying to extract only text from web page but i am facing some problem like texts which are not in written in the page but they are written in code which are comments such as: "include footer", "sidebar.php end" etc. Also the unwanted things are also coming which i really do not want. 我正在尝试仅从网页中提取文本，但是我遇到了一些问题，例如文本不是写在页面中，而是用注释的代码编写，例如：“ include footer”，“ sidebar.php end”等还有我真的不想要的不需要的东西也来了。 Here are the links which i am using for test case ie: 这是我用于测试用例的链接，即：

1) http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html 1） http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html

2) http://www.tutorialspoint.com/cplusplus/index.htm 2） http://www.tutorialspoint.com/cplusplus/index.htm

3) http://www.cplusplus.com/doc/tutorial/program_structure/ 3） http://www.cplusplus.com/doc/tutorial/program_structure/

(so that i can ensure my code is extracting text from any page) （以便我可以确保我的代码从任何页面提取文本）

here is the code which i am facing trouble: 这是我遇到麻烦的代码：

import urllib
from bs4 import BeautifulSoup
url = "http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/" 
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

for script in soup(["script", "style","a","p","li","<!-->","small","<div id=\"footer\">","<div id=\"footer\">","<div id=\"bottom\">"]):
    script.extract()    

text = soup.findAll(text=True)
for p in text:
    print unicode(p)
fo = open('file.txt', 'w')
fo.seek(0, 2)
fo.writelines( unicode(p) )
fo.close()

in this code i have used number 1 link and when i did "inspect element" on that page i found so many comments in that code and this code is extracting them as well. 在这段代码中，我使用了1号链接，当我在该页面上执行“检查元素”时，在该代码中发现了很多注释，并且此代码也提取了它们。 So help please..... 所以请帮助.....

Answer 1

One way would be to use a regex to strip/skip comments when your code encounters a line that the regex matches as a comment. 一种方法是，当您的代码遇到正则表达式作为注释匹配的行时，使用正则表达式去除/跳过注释。

Alternatively, you might be able to use an HTML parser as well. 或者，您也可以使用HTML解析器。 Python has one built into its standard library. Python在其标准库中内置了一个。

https://docs.python.org/2/library/htmlparser.html https://docs.python.org/2/library/htmlparser.html

在Python中从网页提取文本时如何避免注释

问题描述

1 个解决方案

解决方案1
0 2015-03-21 16:04:57

在Python中从网页提取文本时如何避免注释

问题描述

1 个解决方案

解决方案1 0 2015-03-21 16:04:57

解决方案1
0 2015-03-21 16:04:57