I'm trying to scrape the following URL:
link='https://www.opensubtitles.org/en/subtitleserve/sub/6646133'
When I do
html = requests.get(link)
it returns in
html.content
Gibberish (starting at b'PK\\x03\\x04\\x14\\x00\\x00\\x00\\x08\\x00z\\x8c8Q\\xd5H\\xc5\\xd7\\xaf7\\x00\\x00\\xdf\\x95\\x00\\x00^\\x00\\x00\\x00
...)
Why I'm not getting clear text?
You can use zipfile
to unzip it and then check the filenames. If you are interested in extracting the srt files, the following will get the content :
import requests, zipfile, io
r = requests.get("https://www.opensubtitles.org/en/subtitleserve/sub/6646133",
headers = {
"referer": "https://www.opensubtitles.org/en/subtitles/6646133/america-s-got-talent-audition-1-en"
})
z = zipfile.ZipFile(io.BytesIO(r.content))
filenames = z.namelist()
print(filenames)
srt_files = [t for t in filenames if t.endswith(".srt")]
for t in srt_files:
content = z.read(t)
print(content)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.