[英]Python: Check if dataframe cell value exists in Dictionary. If exists replace dataframe value with dictionary key
我有一個csv文件(或數據框),如下所示:
Text Location State
A Florida, USA Florida
B NY New York
C
D abc
以及一個具有鍵值對的字典:
stat_map = {
'FL': 'Florida',
'NY': 'NewYork',
'AR': 'Arkansas',
}
如何刪除第3和第4行,即帶有文本C和D的行,以便我的數據框僅包含我在字典中具有價值的那些行。 最終輸出應如下所示:
Text Location State
A Florida, USA Florida
B NY New York
請幫忙。
您要查找的是pandas.Series.map()
它替換了由一個提供一個值mapper
,在這里states_map
。
我將重用您先前問題中的數據以進行說明
import pandas as pd
states_map = {
'AK': 'Alaska',
'AL': 'Alabama',
'AR': 'Arkansas',
'CA': 'California', # Enrich the dict for the current example
'NY': 'New York' # Same as above
}
>>> df
Out[]:
State
0 California, USA
1 Beverly Hills, CA
2 California
3 CA
4 NY, USA
5 USA
將討論的方法與map
一起使用將得出
states = df['State'].str.split(', ').str[0]
>>> states
Out[]:
0 California
1 Beverly Hills
2 California
3 CA
4 NY
5 USA
Name: State, dtype: object
>>> states.map(states_map)
Out[]:
0 NaN
1 NaN
2 NaN
3 California
4 New York
5 NaN
Name: State, dtype: object
但這不是最佳選擇,因為您會從split
第1行以及map
0和2行中釋放信息。
我認為這樣可以做得更好:
expand=True
從split
獲取所有術語 df_parts = df.State.str.split(', ', expand=True)
>>> df_parts
Out[]:
0 1
0 California USA
1 Beverly Hills CA
2 California None
3 CA None
4 NY USA
5 USA None
mask = df_parts.isin(states_map.values())
>>> df_parts[mask]
Out[]:
0 1
0 California NaN
1 NaN NaN
2 California NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
使用~
(按位NOT)可得到掩碼的倒數。
df_unknown = df_parts[~mask]
>>> df_unknown
Out[]:
0 1
0 NaN USA
1 Beverly Hills CA
2 NaN None
3 CA None
4 NY USA
5 USA None
map
>>> df_unknown.apply(lambda col: col.map(states_map))
Out[]:
0 1
0 NaN NaN
1 NaN California
2 NaN NaN
3 California NaN
4 New York NaN
5 NaN NaN
並在蒙版df_parts
設置這些值
df_parts [〜mask] = df_unknown.apply(lambda col:col.map(states_map))
>>> df_parts
Out[]:
0 1
0 California NaN
1 NaN California
2 California NaN
3 California NaN
4 New York NaN
5 NaN NaN
>>> df_parts[0].fillna(df_parts[1]) # Fill blanks in col 1 with values in col 2
Out[]:
0 California
1 California
2 California
3 California
4 New York
5 NaN
Name: 0, dtype: object
替換原始數據框中的選定值
df['State_new'] = df_parts[0].fillna(df_parts[1])
>>> df
Out[]:
State State_new
0 California, USA California
1 Beverly Hills, CA California
2 California California
3 CA California
4 NY, USA New York
5 USA NaN
這可能不是一個完美的方法,但是希望它會有所幫助。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.