I am using html.parser and urllib.request. I am not going to use any non-native modules, but I am open to using other native ones if they are necessary. Currently (a portion of) my code looks like this:
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
if self.getpos()[0] == 167:
print(self.data)
The issue I am having is that HTMLParser.getpos always returns a tuple of (1, x), where x is a number that increases each time, but seemingly randomly), like this:
(1, 21) (1, 41) (1, 51) (1, 77) (1, 134) (1, 206) (1, 406) (1, 509) (1, 553) (1, 627) (1, 680) (1, 784) (1, 1143) (1, 1368)
I feel like the whole html.parser module is written in a very stupid way and could have been thought out much better. Obviously it works, but it's counter-intuitive.
Full code:
from urllib.request import *
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
print(self.getpos())
if self.getpos()[0] == 167:
print(data)
parser = MyHTMLParser()
html = urlopen("https://www.azlyrics.com/lyrics/aha/takeonme.html").read()
parser.feed(str(html))
Regarding how to parse data from a div - you should track when you enter the div and exit the div, and accumulate data in between these points. This is easy to do with the library, and lies a lot closer to the actual parsing, although I'm not going to get into a debate about what's stupid and what isn't.
Your problem with line numbers is because you're using str
to read a bytes
object. In the interpreter, you can see why this is a problem:
>>> str(b"ab\nc")
"b'ab\\nc'"
It doesn't actually convert it to a kind of equivalent string, but to a string representation. This means newlines in the bytes object are represented literally as \\n
, so you're not getting any line numbers. To decode a bytes object, you should use .decode
. The following code should work:
import sys
from html.parser import HTMLParser
from urllib.request import urlopen
class LyricParser(HTMLParser):
def get_lyrics(self, html):
self.read_lyrics = False
self.lyrics = []
self.feed(html)
return "".join(self.lyrics)
def handle_starttag(self, tag, attrs):
if tag == "div" and self.getpos()[0] == 167:
self.read_lyrics = True
def handle_data(self, data):
if self.read_lyrics:
self.lyrics.append(data)
def handle_endtag(self, tag):
if tag == "div":
self.read_lyrics = False
parser = LyricParser()
page = urlopen("https://www.azlyrics.com/lyrics/aha/takeonme.html")
lyrics = parser.get_lyrics(page.read().decode('utf-8'))
print(lyrics)
For me this correctly outputs something like:
Talking away
I don't know what I'm to say
I'll say it anyway
Today's another day to find you
...
Having looked at the page I must conclude you're right - it's bizzarely structured, and the only way to indentify the lyrics div is by line number, or maybe number of previous divs - if the line number ever fails, you could try keeping a count of divs met in handle_starttag
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.