![](/img/trans.png)
[英]I am trying to scrape the titles from the PDFs on this website. However, I get the titles and the links. Why and how can I fix this?
[英]Why can't I get track titles from url?
我正在嘗試編寫一個Python腳本,該腳本使用BeautifulSoup從此Interent存檔頁面中抓取曲目標題。 我希望能夠輸出:
391106-布魯斯·帕丁頓計划400311-退休的上班族...
但是我找不到標簽。 這是我的腳本:
#!/usr/bin/env python
import getopt, sys
# screen scraping stuff
import urllib2
import re
from bs4 import BeautifulSoup
def usage ( msg ):
print """
usage: get_titles_sherlockholmes_basil.py
%s
""" % ( msg )
#end usage
def output_html ( url ):
soup = BeautifulSoup(urllib2.urlopen( url ).read())
#title = soup.find_all("div", class_="ttl")
#titles = soup.find_all(class_="ttl")
#titles = soup.find_all('<div class="ttl">')
#titles = soup.select("div.ttl")
#titles = soup.find_all("div", attrs={"class": "ttl"})
#titles = soup.find_all("div", class_="jwrow")
#titles = soup.find_all("div", id="jw6_list")
titles = soup.find_all(id="jw6_list")
for title in titles:
print "%s <br>\n" % title
# end output_html
url = 'http://archive.org/details/HQSherlockRathboneTCS'
output_html ( url )
print "<br>-------------------<br>"
sys.exit()
我弄清楚我在做什么錯。 任何幫助表示贊賞。
問題是播放列表是在JavaScript的幫助下在瀏覽器中形成的。 實際的曲目列表位於javascript數組中的script
標簽內:
<script type="text/javascript">
Play('jw6',
[{"title":"1. 391106 - Bruce-Partington Plans","image":"/download/HQSherlockRathboneTCS/391106.png","duration":1764,"sources":[{"file":"/download/HQSherlockRathboneTCS/391106.mp3","type":"mp3","height":"0","width":"0"}],"tracks":[{"file":"https://archive.org/stream/HQSherlockRathboneTCS/391106.png&vtt=vtt.vtt","kind":"thumbnails"}]},
{"title":"2. 400311 - The Retired Colourman","image":"/download/HQSherlockRathboneTCS/400311.png","duration":1755,"sources":[{"file":"/download/HQSherlockRathboneTCS/400311.mp3","type":"mp3","height":"0","width":"0"}],"tracks":[{"file":"https://archive.org/stream/HQSherlockRathboneTCS/400311.png&vtt=vtt.vtt","kind":"thumbnails"}]},
...
{"title":"32. 460204 - The Cross of Damascus","image":"/download/HQSherlockRathboneTCS/460204.png","duration":"1720.07","sources":[{"file":"/download/HQSherlockRathboneTCS/460204.mp3","type":"mp3","height":"0","width":"0"}],"tracks":[{"file":"https://archive.org/stream/HQSherlockRathboneTCS/460204.png&vtt=vtt.vtt","kind":"thumbnails"}]}],
{"start":0,"embed":null,"so":false,"autoplay":false,"width":0,"height":0,"audio":true,"responsive":true,"expand4wideVideos":false,"flash":false,"startPlaylistIdx":0,"identifier":"HQSherlockRathboneTCS","collection":"oldtimeradio","waveformer":"jw-holder","hide_list":false});
</script>
這個想法是使用BeautifulSoup
找到script
標簽,使用正則表達式從腳本中提取列表,然后使用ast.literal_eval()
其加載到python列表中:
from ast import literal_eval
import re
import urllib2
from bs4 import BeautifulSoup
url = 'http://archive.org/details/HQSherlockRathboneTCS'
soup = BeautifulSoup(urllib2.urlopen(url))
script = soup.find('script', text=lambda x: x and 'jw6' in x)
text = script.text.replace('\n', '')
pattern = re.compile(r"Play\('jw6', (.*?),\s+\{\"start")
playlist = literal_eval(pattern.search(text).group(1).strip())
for track in playlist:
print track['title']
打印:
1. 391106 - Bruce-Partington Plans
2. 400311 - The Retired Colourman
3. 440515 - Adventure Of The Missing Bloodstain
4. 450326 - The Book of Tobit
5. 450402 - The Amateur Mendicant Society
...
30. 460121 - Telltale Pigeon Feathers
31. 460128 - Sweeney Todd, Demon Barber
32. 460204 - The Cross of Damascus
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.