抓取后丢失数据

Question

我正在尝试从 IMDB 电影评分前 250 名的 Google 数据中获取数据。

movie_list = top_250_imdb["Title"]

base_url = 'https://www.google.com/search?q='

streaming = []
title = []
price = []

for movie in movie_list:
    query_url = (f'{base_url}{movie}')

    browser.visit(query_url)

    time.sleep(5)

    soup = bs(browser.html, 'lxml')


    results1 = soup.find_all('div', class_ = 'ellip bclEt')

    for result in results1:
        streaming.append(result.text)
        title.append(movie.capitalize())

    results2 = soup.find_all('div', class_ = 'ellip rsj3fb')

    for result in results2:
        price.append(result.text)

抓取后，我得到了len(streaming)和len(title) = 1297 但len(price) = 1296

我无法创建 DataFrame 因为它们的长度不同。

出了什么问题，我该如何解决？

Answer 1

我认为价格中的一个值是 NaN ......我知道如何解决，但你可能会得到帮助......

Try to create a dataframe with price only... Then fill the NaN value using fillna function and then join that price dataframe with your main dataframe....

有点长，但可能有用

Answer 2

这里只需要做一个小改动来监控缺失值......

    results2 = soup.find_all('div', class_ = 'ellip rsj3fb')

    for result in results2:
        if result is not None :
            p = None #u can even replace with 0 but for our convinience it is None here    
        else :
            p = result.text
        price.append(p)

现在您可以检查名为“价格”的列表的 len，并且还将添加缺少的计数

您可以在 append 值时将这段代码添加到“流式传输”和“标题”列表中，以便如果遇到任何缺失值，它会用提供的值替换而不是放弃该操作。

只需用下面的代码替换上面的代码，然后看看缩进就可以了。

movie_list = top_250_imdb["Title"]

base_url = 'https://www.google.com/search?q='

streaming = []
title = []
price = []

for movie in movie_list:
    query_url = (f'{base_url}{movie}')

    browser.visit(query_url)

    time.sleep(5)

    soup = bs(browser.html, 'lxml')


    results1 = soup.find_all('div', class_ = 'ellip bclEt')

    for result in results1:
        streaming.append(result.text)
        title.append(movie.capitalize())

    results2 = soup.find_all('div', class_ = 'ellip rsj3fb')

    for result in results2:
        if result is not None :
            p = None #u can even replace with 0 but for our convinience it is None here    
        else :
            p = result.text
        price.append(p)

希望这可以帮助..

抓取后丢失数据

问题描述

2 个解决方案

解决方案1
0 2020-08-22 04:19:29

解决方案2
0 2020-08-22 09:48:47

抓取后丢失数据

问题描述

2 个解决方案

解决方案1 0 2020-08-22 04:19:29

解决方案2 0 2020-08-22 09:48:47

解决方案1
0 2020-08-22 04:19:29

解决方案2
0 2020-08-22 09:48:47