简体   繁体   中英

Convert a dictionary with lists of differing lengths in dictionary to pandas DataFrame

I'm retrieving a JSON by calling the OMDB API about a movie. I'm trying to add the JSON to another dictionary which is scraping information from here .

Dict which has the scraped information has the structure:

{
         'movie_title': [],
         'review_text': [],
         'review_url': [],
         'reviewed_by': [],
         'score': []
}

I'm dynamically adding keys to the dictionary with values as empty lists by looping over the response from the OMDB API, like so

api_key = ''
ombd_data = requests.get('http://www.omdbapi.com/?apikey=api_key'+'&t=Basmati+Blues&plot=full'
omdb_json = json.loads(omdb_data).content
for curr_key in omdb_json.keys():
    movie_review_dict[curr_key] = []

The dict now has the structure

{
     u'Actors': [],
     u'Awards': [],
     u'BoxOffice': [],
     u'Country': [],
     u'DVD': [],
     u'Director': [],
     u'Genre': [],
     u'Language': [],
     u'Metascore': [],
     u'Plot': [],
     u'Poster': [],
     u'Production': [],
     u'Rated': [],
     u'Ratings': [],
     u'Released': [],
     u'Response': [],
     u'Runtime': [],
     u'Title': [],
     u'Type': [],
     u'Website': [],
     u'Writer': [],
     u'Year': [],
     u'imdbID': [],
     u'imdbRating': [],
     u'imdbVotes': [],
     'movie_title': [],
     'review_text': [],
     'review_url': [],
     'reviewed_by': [],
     'score': []
}

I have a function which reads this URL, uses the BeautifulSoup module and adds elements to the dict. I'm also adding data from the OMBD response at the same time.

def read_html_page(home_page='http://www.rogerebert.com/reviews'):
    movie_details = movie_review_dict
    result = requests.get(url=home_page)
    soup_obj = BeautifulSoup(result_content, 'html5lib')
    wrapper_class = soup_obj.find('div', id='review-list')
    for curr_movie_dom in wrapper_class.find_all('figure'):
        movie_title = curr_movie_dom.find('h5', class_='title').a.get_text()
        movie_critic = curr_movie_dom.find('p', class_='byline').get_text().strip()  
        omdb_dict = get_omdb_data(movie_title=movie_title)
        for curr_key in omdb_dict.keys():
            if curr_key in movie_details:
                movie_details[curr_key].append(omdb_dict[curr_key])
            else:
                movie_details[curr_key] = []
                movie_details[curr_key].append(omdb_dict[curr_key])
    return movie_details

I'm trying to store the dict into a pandas DataFrame, but I'm getting the error

ValueError('arrays must all be same length')

That's because some attributes from the OMDB response, like 'Languages', 'Website' exist for some movies, and not for others.

I've tried

movie_df = pd.DataFrame(movie_review_dict)
movie_df = pd.DataFrame.from_dict(movie_details)

And am running into the same Error.

You can try appending to an empty dataframe using pandas.DataFrame.append

df = pd.DataFrame()
df = df.append(movie_review_dict, ignore_index=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM