简体   繁体   中英

Extracting value for one dictionary key in Pandas based on another in the same dictionary

This is from an R guy.

I have this mess in a Pandas column: data['crew'] .

array(["[{'credit_id': '54d5356ec3a3683ba0000039', 'department': 'Production', 'gender': 1, 'id': 494, 'job': 'Casting', 'name': 'Terri Taylor', 'profile_path': None}, {'credit_id': '56407fa89251417055000b58', 'department': 'Sound', 'gender': 0, 'id': 6745, 'job': 'Music Editor', 'name': 'Richard Henderson', 'profile_path': None}, {'credit_id': '5789212392514135d60025fd', 'department': 'Production', 'gender': 2, 'id': 9250, 'job': 'Executive In Charge Of Production', 'name': 'Jeffrey Stott', 'profile_path': None}, {'credit_id': '57892074c3a36835fa002886', 'department': 'Costume & Make-Up', 'gender': 0, 'id': 23783, 'job': 'Makeup Artist', 'name': 'Heather Plott', 'profile_path': None}

It goes on for quite some time. Each new dict starts with a credit_id field. One sell can hold several dicts in an array.

Assume I want the names of all Casting directors, as shown in the first entry. I need to check check the job entry in every dict and, if it's Casting , grab what's in the name field and store it in my data frame in data['crew'] .

I tried several strategies, then backed off and went for something simple. Running the following shut me down, so I can't even access a simple field. How can I get this done in Pandas.

for row in data.head().iterrows():
    if row['crew'].job == 'Casting':
        print(row['crew'])

EDIT: Error Message

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-138-aa6183fdf7ac> in <module>()
      1 for row in data.head().iterrows():
----> 2     if row['crew'].job == 'Casting':
      3         print(row['crew'])

TypeError: tuple indices must be integers or slices, not str

EDIT: Code used to get the array of dict (strings?) in the first place.

def convert_JSON(data_as_string):
    try:
        dict_representation = ast.literal_eval(data_as_string)
        return dict_representation
    except ValueError:
        return []

data["crew"] = data["crew"].map(lambda x: sorted([d['name'] if d['job'] == 'Casting' else '' for d in convert_JSON(x)])).map(lambda x: ','.join(map(str, x))

To create a DataFrame from your sample data, write:

df = pd.DataFrame(data=[
  { 'credit_id': '54d5356ec3a3683ba0000039', 'department': 'Production',
    'gender': 1, 'id': 494, 'job': 'Casting', 'name': 'Terri Taylor',
    'profile_path': None},
  { 'credit_id': '56407fa89251417055000b58', 'department': 'Sound',
    'gender': 0, 'id': 6745, 'job': 'Music Editor',
    'name': 'Richard Henderson', 'profile_path': None},
  { 'credit_id': '5789212392514135d60025fd', 'department': 'Production',
    'gender': 2, 'id': 9250, 'job': 'Executive In Charge Of Production',
    'name': 'Jeffrey Stott', 'profile_path': None},
  { 'credit_id': '57892074c3a36835fa002886', 'department': 'Costume & Make-Up',
    'gender': 0, 'id': 23783, 'job': 'Makeup Artist',
    'name': 'Heather Plott', 'profile_path': None}])

Then you can get your data with a single instruction:

df[df.job == 'Casting'].name

The result is:

0    Terri Taylor
Name: name, dtype: object

The above result is Pandas Series object with names found. In this case, 0 is the index value for the record found and Terri Taylor is the name of (the only in your data) Casting Director.

Edit

If you want just a list (not Series ), write:

df[df.job == 'Casting'].name.tolist()

The result is ['Terri Taylor'] - just a list.

I think, both my solutions should be quicker than "ordinary" loop based on iterrows() .

Checking the execution time, you may try also yet another solution:

df.query("job == 'Casting'").name.tolist()

==========

And as far as your code is concerned:

iterrows() returns each time a pair containing:

  • the key of the current row,
  • a named tuple - the content of this row.

So your loop should look something like:

for row in df.iterrows():
    if row[1].job == 'Casting':
        print(row[1]['name'])

You can not write row[1].name because it refers to the index value (here we have a collision with default attributes of the named tuple).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM