简体   繁体   English

抓取后丢失数据

[英]Missing data after scraping

I am trying to scrape Google data on the top 250 IMDB movie ratings.我正在尝试从 IMDB 电影评分前 250 名的 Google 数据中获取数据。

movie_list = top_250_imdb["Title"]

base_url = 'https://www.google.com/search?q='

streaming = []
title = []
price = []

for movie in movie_list:
    query_url = (f'{base_url}{movie}')

    browser.visit(query_url)

    time.sleep(5)

    soup = bs(browser.html, 'lxml')


    results1 = soup.find_all('div', class_ = 'ellip bclEt')

    for result in results1:
        streaming.append(result.text)
        title.append(movie.capitalize())

    results2 = soup.find_all('div', class_ = 'ellip rsj3fb')

    for result in results2:
        price.append(result.text)

After scraping, I got both the len(streaming) and len(title) = 1297 but the len(price) = 1296抓取后,我得到了len(streaming)len(title) = 1297 但len(price) = 1296

I couldn't create a DataFrame because they are not in the same length.我无法创建 DataFrame 因为它们的长度不同。

What went wrong and how do I fix it?出了什么问题,我该如何解决?

I think the one of the values in price is NaN... Idk how to solve but you might get help with that...我认为价格中的一个值是 NaN ......我知道如何解决,但你可能会得到帮助......

Try to create a dataframe with price only... Then fill the NaN value using fillna function and then join that price dataframe with your main dataframe.... Try to create a dataframe with price only... Then fill the NaN value using fillna function and then join that price dataframe with your main dataframe....

A bit long but might work有点长,但可能有用

just a small change is needed here to monitor the missing values...这里只需要做一个小改动来监控缺失值......

    results2 = soup.find_all('div', class_ = 'ellip rsj3fb')

    for result in results2:
        if result is not None :
            p = None #u can even replace with 0 but for our convinience it is None here    
        else :
            p = result.text
        price.append(p)

now u can check the len of the list named "price" and that missing count would also be added现在您可以检查名为“价格”的列表的 len,并且还将添加缺少的计数

u can add this piece of code while append values to the "streaming" and "title" lists so that if it encounters any missing values it replaces with the provided value instead of abandoning that action.您可以在 append 值时将这段代码添加到“流式传输”和“标题”列表中,以便如果遇到任何缺失值,它会用提供的值替换而不是放弃该操作。

just replace the above code with the below and just have a look at indentation and it works fine.只需用下面的代码替换上面的代码,然后看看缩进就可以了。

movie_list = top_250_imdb["Title"]

base_url = 'https://www.google.com/search?q='

streaming = []
title = []
price = []

for movie in movie_list:
    query_url = (f'{base_url}{movie}')

    browser.visit(query_url)

    time.sleep(5)

    soup = bs(browser.html, 'lxml')


    results1 = soup.find_all('div', class_ = 'ellip bclEt')

    for result in results1:
        streaming.append(result.text)
        title.append(movie.capitalize())

    results2 = soup.find_all('div', class_ = 'ellip rsj3fb')

    for result in results2:
        if result is not None :
            p = None #u can even replace with 0 but for our convinience it is None here    
        else :
            p = result.text
        price.append(p)

Hope this helps..希望这可以帮助..

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM