简体   繁体   中英

Python: Check if dataframe cell value exists in Dictionary. If exists replace dataframe value with dictionary key

I have a csv file(or dataframe) like below :

Text    Location    State
A   Florida, USA    Florida
B   NY              New York
C       
D   abc 

And a dictionary with key value pair as :

stat_map = {
        'FL': 'Florida',
        'NY': 'NewYork',
        'AR': 'Arkansas',
}

How may I delete row 3rd and 4th ie row with Text C & D so that my dataframe contains only those rows for which i have value in dictionary. The final output should look like :

Text    Location    State
    A   Florida, USA    Florida
    B   NY              New York

Please help.

What you're looking for is pandas.Series.map() , which replaces a value by that provided in a mapper , here states_map .

I will reuse data from you previous question for illustration

import pandas as pd

states_map = {
        'AK': 'Alaska',
        'AL': 'Alabama',
        'AR': 'Arkansas',
        'CA': 'California',  # Enrich the dict for the current example
        'NY': 'New York'     # Same as above
}

>>> df
Out[]:
               State
0    California, USA
1  Beverly Hills, CA
2         California
3                 CA
4            NY, USA
5                USA

Using the discussed method with map will give

states = df['State'].str.split(', ').str[0]

>>> states
Out[]:
0       California
1    Beverly Hills
2       California
3               CA
4               NY
5              USA
Name: State, dtype: object

>>> states.map(states_map)
Out[]:
0           NaN
1           NaN
2           NaN
3    California
4      New York
5           NaN
Name: State, dtype: object

But this is not optimal, as you loose information from row 1 with the split and from rows 0 and 2 with the map .

I think it can be done better like this:

Get all terms from split using expand=True

df_parts = df.State.str.split(', ', expand=True)

>>> df_parts
Out[]:
               0     1
0     California   USA
1  Beverly Hills    CA
2     California  None
3             CA  None
4             NY   USA
5            USA  None

Get places where the state is correct

mask = df_parts.isin(states_map.values())

>>> df_parts[mask]
Out[]:
            0    1
0  California  NaN
1         NaN  NaN
2  California  NaN
3         NaN  NaN
4         NaN  NaN
5         NaN  NaN

Using ~ (bitwise NOT) gives us the inverse of the mask.

df_unknown = df_parts[~mask]

>>> df_unknown
Out[]:
               0     1
0            NaN   USA
1  Beverly Hills    CA
2            NaN  None
3             CA  None
4             NY   USA
5            USA  None

Use map where state is not known

>>> df_unknown.apply(lambda col: col.map(states_map))
Out[]:
            0           1
0         NaN         NaN
1         NaN  California
2         NaN         NaN
3  California         NaN
4    New York         NaN
5         NaN         NaN

And set these values in masked df_parts

df_parts[~mask] = df_unknown.apply(lambda col: col.map(states_map))

>>> df_parts
Out[]:
            0           1
0  California         NaN
1         NaN  California
2  California         NaN
3  California         NaN
4    New York         NaN
5         NaN         NaN

Reunify values

>>> df_parts[0].fillna(df_parts[1])  # Fill blanks in col 1 with values in col 2
Out[]:
0    California
1    California
2    California
3    California
4      New York
5           NaN
Name: 0, dtype: object

Replace curated values in original dataframe

df['State_new'] = df_parts[0].fillna(df_parts[1])

>>> df
Out[]:
               State   State_new
0    California, USA  California
1  Beverly Hills, CA  California
2         California  California
3                 CA  California
4            NY, USA    New York
5                USA         NaN

It may not be a perfect approach, but hope it will help.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM