[英]Soup works on one IMBD page but not on another. How to solve?
url1 = "https://www.imdb.com/user/ur34087578/watchlist"
url = "https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv"
results1 = requests.get(url1, headers=headers)
results = requests.get(url, headers=headers)
soup1 = BeautifulSoup(results1.text, "html.parser")
soup = BeautifulSoup(results.text, "html.parser")
movie_div1 = soup1.find_all('div', class_='lister-item-content')
movie_div = soup.find_all('div', class_='lister-item mode-advanced')
#using unique tag for each movie in the respective link
print(movie_div1)
#empty list
print(movie_div)
#gives perfect list
Why is movie_div1 giving an empty list?为什么 movie_div1 给出一个空列表? I am not able to identify any difference in the URL structures to indicate the code should be different.我无法识别 URL 结构中的任何差异,以表明代码应该不同。 All leads appreciated.所有线索表示赞赏。
Unfortunately the div you want is processed by a javascript code so you can't get by scraping the raw html request.不幸的是,您想要的 div 由 javascript 代码处理,因此您无法通过抓取原始 html 请求来获得。
You can get the movies you want by the request json your browser gets, which you won't need to scrape the code with beautifulsoup, making your script much faster.您可以通过浏览器获取的请求 json 获取您想要的电影,您无需使用beautifulsoup 抓取代码,从而使您的脚本更快。
2nd option is using Selenium.第二个选项是使用 Selenium。
Good luck.祝你好运。
As @SakuraFreak mentioned, you could parse the JSON received.正如@SakuraFreak 提到的,您可以解析收到的 JSON 。 However, this JSON response is embedded within the HTML itself which is later converted to HTML by browser JS (this is what you see as <div class="lister-item-content">...</div>
.但是,此 JSON 响应嵌入在 HTML 本身中,该响应随后由浏览器 JS 转换为 HTML(这就是您所看到的<div class="lister-item-content">...</div>
。
For example, this is how you would extract the JSON content from the HTML to display movie/show names from the watchlist:例如,您可以通过以下方式从 HTML 中提取 JSON 内容以显示关注列表中的电影/节目名称:
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.imdb.com/user/ur34087578/watchlist"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
details = str(soup.find('span', class_='ab_widget'))
json_initial = "IMDbReactInitialState.push("
json_leftover = ");\n"
json_start = details.find(json_initial) + len(json_initial)
details = details[json_start:]
json_end = details.find(json_leftover)
json_data = json.loads(details[:json_end])
imdb_titles = json_data["titles"]
for item in imdb_titles.values():
print(item["primary"]["title"])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.