简体   繁体   中英

transform/reshaping data for analyzing - Python/pandas

I'm on a mission where I have to extract information from the web (IMDB-Oscar-winning movies) and then analyze the data.

I use Python libraries (Requests, Pandas) on jupyter

As of this moment, I've already pulled the data from the site And the data is stored in a list within a list as in this image .

My question is, how do I change the shape so that I will be more comfortable analyzing the data?I would be happy to receive the data in a tabular structure but the length of the lists is not identical.

You can transform the list of lists into a dictionary and then use an object db like MongoDB or just store as json for further analysis.

myList = [[('Name', 'Moonlight'), ('Genres', ['Drama']),
           ('Writers', ['Barry Jenkins', 'Tarell Alvin McCraney']),
           ('Actors', ['Mahershala Ali', 'Shariff Earp', 'Duan Sanderson']),
           ('Directors', ['Barry Jenkins']), ('Duration', '1h 51min')]]
l = [tup for item in myList for tup in item]
d = {col: attr for col, attr in l}
print(d)
>>{'Genres': ['Drama'], 'Name': 'Moonlight', 'Directors': ['Barry Jenkins'], 'Writers': ['Barry Jenkins', 'Tarell Alvin McCraney'], 'Actors': ['Mahershala Ali', 'Shariff Earp', 'Duan Sanderson'], 'Duration': '1h 51min'}

If you want your data to be tabular you'll want to visualize the tables two-dimensionally like an RDBMS with primary/foreign key relationships because storing lists in columns in Pandas doesn't really work well.

movie (mov_id*, name, duration)
directors (mov_id*, director_name)
writers (mov_id*, writer_name)
actors (mov_id*, actor_name)

You'll have four DataFrames from this schema (some table optimization might yield less tables), of which you can do relational algebra on using Pandas to get the work you need done.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM