简体   繁体   English

Python HTMLParser 打印出空行

[英]Python HTMLParser printing out blank lines

I'm playing around with python's HTMLParser and having an issue with it printing out blank lines.我在玩 python 的 HTMLParser 并在打印空行时遇到问题。

from HTMLParser import HTMLParser
import urllib2
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
     print "Encountered some data  :", data

# instantiate the parser and fed it some HTML
url = 'http://www.ngccoin.com/price-guide/us/flying-eagle-cents-pscid-16-desig-ms'
req = urllib2.Request(url, headers={'User-Agent' :"Magic Browser"})
response = urllib2.urlopen(req)
html = response.read()

parser = MyHTMLParser()
parser.feed( html )

My issue is when it hits a data section it prints out just new lines as well as actual data.我的问题是当它点击数据部分时,它只打印出新行和实际数据。 MY output looks a lot like:我的输出看起来很像:

Encountered some data  :

Encountered some data  : Official Grading Service of
Encountered some data  :

Encountered some data  :

Encountered some data  :

How should I go about getting it to ignore those lines with just a new line?我应该如何让它只用一个新行忽略这些行?

Simply have it ignore those lines with just a new line: 只需让它只换一行就忽略那些行:

def handle_data(self, data):
    if data == '\n':
        return
    print "Encountered some data  :", data

Or, have it ignore any data consisting of only whitespace: 或者,让它忽略仅由空格组成的任何数据:

def handle_data(self, data):
    if not data.strip():
        return
    print "Encountered some data  :", data

Because data passes one line at a time, Data needs to be aggregated as such:因为数据一次通过一行,所以数据需要这样聚合:

def handle_data(self, data):
  self.cell += data

Then later in the close tag....然后在关闭标签中......

def handle_endtag(self, tag):
  self.somevariable = self.cell.strip()
  self.cell = ''

Stripping the newlines at the end preserves the formatting of the data.去掉末尾的换行符可以保留数据的格式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM