简体   繁体   English

Soup 可以在一个 IMBD 页面上工作,但不能在另一个页面上工作。 怎么解决?

[英]Soup works on one IMBD page but not on another. How to solve?

url1 = "https://www.imdb.com/user/ur34087578/watchlist"
url = "https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv"

results1 = requests.get(url1, headers=headers)
results = requests.get(url, headers=headers)
soup1 = BeautifulSoup(results1.text, "html.parser")
soup = BeautifulSoup(results.text, "html.parser")

movie_div1 = soup1.find_all('div', class_='lister-item-content')
movie_div = soup.find_all('div', class_='lister-item mode-advanced')
#using unique tag for each movie in the respective link

print(movie_div1)
#empty list
print(movie_div)
#gives perfect list

Why is movie_div1 giving an empty list?为什么 movie_div1 给出一个空列表? I am not able to identify any difference in the URL structures to indicate the code should be different.我无法识别 URL 结构中的任何差异,以表明代码应该不同。 All leads appreciated.所有线索表示赞赏。

Unfortunately the div you want is processed by a javascript code so you can't get by scraping the raw html request.不幸的是,您想要的 div 由 javascript 代码处理,因此您无法通过抓取原始 html 请求来获得。

You can get the movies you want by the request json your browser gets, which you won't need to scrape the code with beautifulsoup, making your script much faster.您可以通过浏览器获取的请求 json 获取您想要的电影,您无需使用beautifulsoup 抓取代码,从而使您的脚本更快。

2nd option is using Selenium.第二个选项是使用 Selenium。

Good luck.祝你好运。

As @SakuraFreak mentioned, you could parse the JSON received.正如@SakuraFreak 提到的,您可以解析收到的 JSON 。 However, this JSON response is embedded within the HTML itself which is later converted to HTML by browser JS (this is what you see as <div class="lister-item-content">...</div> .但是,此 JSON 响应嵌入在 HTML 本身中,该响应随后由浏览器 JS 转换为 HTML(这就是您所看到的<div class="lister-item-content">...</div>

For example, this is how you would extract the JSON content from the HTML to display movie/show names from the watchlist:例如,您可以通过以下方式从 HTML 中提取 JSON 内容以显示关注列表中的电影/节目名称:

import requests
from bs4 import BeautifulSoup
import json

url = "https://www.imdb.com/user/ur34087578/watchlist"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

details = str(soup.find('span', class_='ab_widget'))

json_initial = "IMDbReactInitialState.push("
json_leftover = ");\n"

json_start = details.find(json_initial) + len(json_initial)
details = details[json_start:]
json_end = details.find(json_leftover)

json_data = json.loads(details[:json_end])

imdb_titles = json_data["titles"]
for item in imdb_titles.values():
    print(item["primary"]["title"])

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何增加一个值并重设另一个值。 蟒蛇 - How do I increment one value and reset another. PYTHON PuLP 不适用于一个列表,但适用于另一个列表。 这两者有什么区别? - PuLP doesn't work with one list but works with another. What is the difference between these two? 美丽的汤在一个网站上返回空列表,但在另一个网站上工作 - Beautiful soup returns empty list on one website, but works on another website Django:静态文件互相引用。 - Django: Static files referring to one another. Python代码在一个目录中,数据库文件在另一个目录中。 如何打开数据库和进程? - Python code is in one directory, database file is in another. How to open db and process? 如何用另一个张量扩展一个张量。 所以结果包含 2 个张量的所有元素,请参见示例 - How to extend one tensor with another. So the result contains all the elements from the 2 tensors, see example 我在一个张量中有索引,在另一个张量中有值。 我如何用这个创建一个新的张量? - I have indices in one tensor and values in another. How do I create a new tensor with this? Gekko:我正在使用 Gekko 来优化我的 cad 模型参数。 我的变量相互依赖。 当我解决它给出错误 - Gekko: I am using the gekko to optimize but my cad models parameters. my variables are dependent on one another. When I solve it gives error 当有另一个类似的div时,如何汤特定的div class? - How to soup particular div class when there is another with similar one? 将数据从一种功能传递到另一种功能。 蟒蛇 - Passing data from one fucntion to another. Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM