[英]Missing data after scraping
我正在尝试从 IMDB 电影评分前 250 名的 Google 数据中获取数据。
movie_list = top_250_imdb["Title"]
base_url = 'https://www.google.com/search?q='
streaming = []
title = []
price = []
for movie in movie_list:
query_url = (f'{base_url}{movie}')
browser.visit(query_url)
time.sleep(5)
soup = bs(browser.html, 'lxml')
results1 = soup.find_all('div', class_ = 'ellip bclEt')
for result in results1:
streaming.append(result.text)
title.append(movie.capitalize())
results2 = soup.find_all('div', class_ = 'ellip rsj3fb')
for result in results2:
price.append(result.text)
抓取后,我得到了len(streaming)
和len(title)
= 1297 但len(price)
= 1296
我无法创建 DataFrame 因为它们的长度不同。
出了什么问题,我该如何解决?
我认为价格中的一个值是 NaN ......我知道如何解决,但你可能会得到帮助......
Try to create a dataframe with price only... Then fill the NaN value using fillna function and then join that price dataframe with your main dataframe....
有点长,但可能有用
这里只需要做一个小改动来监控缺失值......
results2 = soup.find_all('div', class_ = 'ellip rsj3fb')
for result in results2:
if result is not None :
p = None #u can even replace with 0 but for our convinience it is None here
else :
p = result.text
price.append(p)
现在您可以检查名为“价格”的列表的 len,并且还将添加缺少的计数
您可以在 append 值时将这段代码添加到“流式传输”和“标题”列表中,以便如果遇到任何缺失值,它会用提供的值替换而不是放弃该操作。
只需用下面的代码替换上面的代码,然后看看缩进就可以了。
movie_list = top_250_imdb["Title"]
base_url = 'https://www.google.com/search?q='
streaming = []
title = []
price = []
for movie in movie_list:
query_url = (f'{base_url}{movie}')
browser.visit(query_url)
time.sleep(5)
soup = bs(browser.html, 'lxml')
results1 = soup.find_all('div', class_ = 'ellip bclEt')
for result in results1:
streaming.append(result.text)
title.append(movie.capitalize())
results2 = soup.find_all('div', class_ = 'ellip rsj3fb')
for result in results2:
if result is not None :
p = None #u can even replace with 0 but for our convinience it is None here
else :
p = result.text
price.append(p)
希望这可以帮助..
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.