简体   繁体   中英

I want to retrieve the text inside a html tag on a specific line

I am using html.parser and urllib.request. I am not going to use any non-native modules, but I am open to using other native ones if they are necessary. Currently (a portion of) my code looks like this:

class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        if self.getpos()[0] == 167:
            print(self.data)

The issue I am having is that HTMLParser.getpos always returns a tuple of (1, x), where x is a number that increases each time, but seemingly randomly), like this:

(1, 21)
(1, 41)
(1, 51)
(1, 77)
(1, 134)
(1, 206)
(1, 406)
(1, 509)
(1, 553)
(1, 627)
(1, 680)
(1, 784)
(1, 1143)
(1, 1368)

I feel like the whole html.parser module is written in a very stupid way and could have been thought out much better. Obviously it works, but it's counter-intuitive.
Full code:

from urllib.request import *
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
      print(self.getpos())
      if self.getpos()[0] == 167:
        print(data)
parser = MyHTMLParser()
html = urlopen("https://www.azlyrics.com/lyrics/aha/takeonme.html").read()
parser.feed(str(html))

Regarding how to parse data from a div - you should track when you enter the div and exit the div, and accumulate data in between these points. This is easy to do with the library, and lies a lot closer to the actual parsing, although I'm not going to get into a debate about what's stupid and what isn't.

Your problem with line numbers is because you're using str to read a bytes object. In the interpreter, you can see why this is a problem:

>>> str(b"ab\nc")
"b'ab\\nc'"

It doesn't actually convert it to a kind of equivalent string, but to a string representation. This means newlines in the bytes object are represented literally as \\n , so you're not getting any line numbers. To decode a bytes object, you should use .decode . The following code should work:

import sys

from html.parser import HTMLParser
from urllib.request import urlopen

class LyricParser(HTMLParser):
    def get_lyrics(self, html):
        self.read_lyrics = False
        self.lyrics = []
        self.feed(html)
        return "".join(self.lyrics)

    def handle_starttag(self, tag, attrs):
        if tag == "div" and self.getpos()[0] == 167:
            self.read_lyrics = True

    def handle_data(self, data):
        if self.read_lyrics:
            self.lyrics.append(data)

    def handle_endtag(self, tag):
        if tag == "div":
            self.read_lyrics = False

parser = LyricParser()
page = urlopen("https://www.azlyrics.com/lyrics/aha/takeonme.html")
lyrics = parser.get_lyrics(page.read().decode('utf-8'))
print(lyrics)

For me this correctly outputs something like:

Talking away
I don't know what I'm to say
I'll say it anyway
Today's another day to find you
...

Having looked at the page I must conclude you're right - it's bizzarely structured, and the only way to indentify the lyrics div is by line number, or maybe number of previous divs - if the line number ever fails, you could try keeping a count of divs met in handle_starttag .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM