简体   繁体   中英

Missing data after scraping

I am trying to scrape Google data on the top 250 IMDB movie ratings.

movie_list = top_250_imdb["Title"]

base_url = 'https://www.google.com/search?q='

streaming = []
title = []
price = []

for movie in movie_list:
    query_url = (f'{base_url}{movie}')

    browser.visit(query_url)

    time.sleep(5)

    soup = bs(browser.html, 'lxml')


    results1 = soup.find_all('div', class_ = 'ellip bclEt')

    for result in results1:
        streaming.append(result.text)
        title.append(movie.capitalize())

    results2 = soup.find_all('div', class_ = 'ellip rsj3fb')

    for result in results2:
        price.append(result.text)

After scraping, I got both the len(streaming) and len(title) = 1297 but the len(price) = 1296

I couldn't create a DataFrame because they are not in the same length.

What went wrong and how do I fix it?

I think the one of the values in price is NaN... Idk how to solve but you might get help with that...

Try to create a dataframe with price only... Then fill the NaN value using fillna function and then join that price dataframe with your main dataframe....

A bit long but might work

just a small change is needed here to monitor the missing values...

    results2 = soup.find_all('div', class_ = 'ellip rsj3fb')

    for result in results2:
        if result is not None :
            p = None #u can even replace with 0 but for our convinience it is None here    
        else :
            p = result.text
        price.append(p)

now u can check the len of the list named "price" and that missing count would also be added

u can add this piece of code while append values to the "streaming" and "title" lists so that if it encounters any missing values it replaces with the provided value instead of abandoning that action.

just replace the above code with the below and just have a look at indentation and it works fine.

movie_list = top_250_imdb["Title"]

base_url = 'https://www.google.com/search?q='

streaming = []
title = []
price = []

for movie in movie_list:
    query_url = (f'{base_url}{movie}')

    browser.visit(query_url)

    time.sleep(5)

    soup = bs(browser.html, 'lxml')


    results1 = soup.find_all('div', class_ = 'ellip bclEt')

    for result in results1:
        streaming.append(result.text)
        title.append(movie.capitalize())

    results2 = soup.find_all('div', class_ = 'ellip rsj3fb')

    for result in results2:
        if result is not None :
            p = None #u can even replace with 0 but for our convinience it is None here    
        else :
            p = result.text
        price.append(p)

Hope this helps..

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM