[英]Best way to fix pandas DataFrame with some cell values in wrong columns
我有一個 pandas DataFrame ,其中一些單元格位於錯誤的列中。 它看起來像這樣:
城市 | 河 | 價格 | 車 |
---|---|---|---|
紐約 | 哈德宗 | 低的 | 大眾 |
洛杉磯 | 育空地區 | 高的 | 特斯拉 |
拉斯維加斯 | 低的 | 哈德宗 | 大眾 |
低的 | 紐約 | 特斯拉 | 育空地區 |
我列舉了 City、River、Price 和 Car 列中所有可能值的列表。
修復此數據集並將所有值放在 pandas 的右列中的最佳方法是什么?
更新:
這就是我創建的代碼,對我有用:
x = pd.DataFrame({'name':['andrew', 'karl', 'jhon', 'jack', 'bob', 'high'], 'education': ['high', 'middle', 'lviv', 'high', 'elementary', 'kyiv'], 'city':['lviv', 'kyiv', 'elementary', 'kharkiv', 'kyiv', 'mike']})
ydata = pd.DataFrame(x)
fdict = {}
fdict['name'] = ['andrew', 'karl', 'jhon', 'jack', 'bob', 'mike']
fdict['education'] = ['high', 'middle', 'elementary']
fdict['city'] = ['lviv', 'kyiv', 'kharkiv']
filter = {}
for key in fdict:
filter[key] = {}
for key1 in fdict:
regexx = '|'.join(fdict[key1])
for key2 in fdict:
filter[key2][key1] = ydata[key2].str.contains(regexx, regex=True, na=False)
indata = {}
for key in fdict:
indata[key] = {}
for key1 in fdict:
for key2 in fdict:
if key1 != key2:
indata[key1][key2] = ydata.loc[filter[key1][key2], key1]
for key in fdict:
fn = ~filter[key][key]
ydata.loc[fn, key] = np.nan
for key1 in fdict:
for key2 in fdict:
if key1 != key2:
ydata.loc[filter[key1][key2], key2] = indata[key1][key2]
這就是我創建的代碼,對我有用:
x = pd.DataFrame({'name':['andrew', 'karl', 'jhon', 'jack', 'bob', 'high'], 'education': ['high', 'middle', 'lviv', 'high', 'elementary', 'kyiv'], 'city':['lviv', 'kyiv', 'elementary', 'kharkiv', 'kyiv', 'mike']})
ydata = pd.DataFrame(x)
fdict = {}
fdict['name'] = ['andrew', 'karl', 'jhon', 'jack', 'bob', 'mike']
fdict['education'] = ['high', 'middle', 'elementary']
fdict['city'] = ['lviv', 'kyiv', 'kharkiv']
filter = {}
for key in fdict:
filter[key] = {}
for key1 in fdict:
regexx = '|'.join(fdict[key1])
for key2 in fdict:
filter[key2][key1] = ydata[key2].str.contains(regexx, regex=True, na=False)
indata = {}
for key in fdict:
indata[key] = {}
for key1 in fdict:
for key2 in fdict:
if key1 != key2:
indata[key1][key2] = ydata.loc[filter[key1][key2], key1]
for key in fdict:
fn = ~filter[key][key]
ydata.loc[fn, key] = np.nan
for key1 in fdict:
for key2 in fdict:
if key1 != key2:
ydata.loc[filter[key1][key2], key2] = indata[key1][key2]
其他解決方案:
cols_list = []
for col in ydata.columns:
ydata[col] = ydata[col].apply(lambda x: {key: x for key in fdict.keys() if x in fdict[key]})
ydata_list = ydata.to_dict('split')['data']
# dataset reconstraction
update_ydata_list = []
for y in ydata_list:
temp_dict = {}
for d in y:
temp_dict.update(d)
update_ydata_list.append(temp_dict.copy())
# final result
pd.DataFrame(update_ydata_list)
您可以在完整數據集上對其進行測試
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.