![](/img/trans.png)
[英]From pandas to dictionary so that the value in column one will be the key and the corresponding values in column two will all be in a list
[英]Pandas - lambda - values in list and corresponding value from another column where values in list
考慮下面的 dataframe:
Name identifierOne identifierTwo
0 Name1 ['12032', '444', '555'] ['aaa', 'bbb', 'ccc']
1 Name2 ['666', '51206', '777'] ['ddd', 'eee', 'fff']
2 Name3 ['111', '222', '333'] ['ggg', 'hhh', 'iii']
我可以獲得“identifierOne”具有“120”的條目的行:
print(df[df['identifierOne'].apply(lambda x: '120' in str(x))][['Name', 'identifierOne', 'identifierTwo']])
這將返回:
Name identifierOne identifierTwo
0 Name1 ['12032', '444', '555'] ['aaa', 'bbb', 'ccc']
1 Name2 ['666', '51206', '777'] ['ddd', 'eee', 'fff']
我怎樣才能得到 a) 列表中具有 '120' 的項目和 b) 它是來自 'identifierTwo' 的對應值? 預期 Output:
Name identifierOne identifierTwo
0 Name1 ['12032'] ['aaa']
1 Name2 ['51206'] ['eee']
或者只是字符串:
Name identifierOne identifierTwo
0 Name1 '12032' 'aaa'
1 Name2 '51206' 'eee'
>>> df
Name identifierOne identifierTwo
0 Name1 [12032, 444, 555] [aaa, bbb, ccc]
1 Name2 [666, 51206, 777] [ddd, eee, fff]
2 Name3 [111, 222, 333] [ggg, hhh, iii]
>>> s1 = df['identifierOne'].explode()
>>> s2 = df['identifierTwo'].explode()
>>> cond = s1.str.contains('120')
>>> df.assign(identifierOne=s1[cond], identifierTwo=s2[cond]).dropna()
Name identifierOne identifierTwo
0 Name1 12032 aaa
1 Name2 51206 eee
筆記:
如果最初的identifier
列是list
的str
表示,則使用ast.literal_eval
:
>>> from ast import literal_eval
>>> df[['identifierOne', 'identifierTwo']] = (
df.filter(like='identifier').applymap(literal_eval)
)
您可以嘗試轉換為列表,然后使用explode
、 concat
和df.query
我們可以在下面執行:
首先將列表的字符串表示形式轉換為實際列表(如果輸入已經是列表,請忽略此步驟)
import ast
df[['identifierOne', 'identifierTwo']] = (df[['identifierOne', 'identifierTwo']]
.applymap(ast.literal_eval))
分解列並連接它們,最后使用df.query
過濾必要的行,然后加入“名稱”列。
cols = ['identifierOne','identifierTwo']
out = (pd.concat([df[col].explode() for col in cols],axis=1,keys=cols)
.query("identifierOne.str.contains('120')",engine='python').join(df[['Name']]))
或方法 2 - 使用可調用對象:
cols = ['identifierOne','identifierTwo']
out = (pd.concat([df[col].explode() for col in cols],axis=1,keys=cols)
.join(df[['Name']]).loc[lambda x: x['identifierOne'].str.contains('120')])
print(out)
identifierOne identifierTwo Name
0 12032 aaa Name1
1 51206 eee Name2
這是我的整個思考過程:
In [314]: df = pd.DataFrame(dict(Name='Name1 Name2 Name3'.split(), id1=[['12032', '444', '555'], ['666', '51206', '777'], ['111', '222', '333']], id2=[['aaa', 'bbb', 'ccc'], ['ddd', 'eee', 'fff'], ['ggg', 'hhh', 'iii']]))
In [315]: df['id1e'] = df.id1.apply(lambda L:list(enumerate(L)))
In [316]: df['id2e'] = df.id2.apply(lambda L:list(enumerate(L)))
In [317]: df
Out[317]:
Name id1 id2 id1e id2e
0 Name1 [12032, 444, 555] [aaa, bbb, ccc] [(0, 12032), (1, 444), (2, 555)] [(0, aaa), (1, bbb), (2, ccc)]
1 Name2 [666, 51206, 777] [ddd, eee, fff] [(0, 666), (1, 51206), (2, 777)] [(0, ddd), (1, eee), (2, fff)]
2 Name3 [111, 222, 333] [ggg, hhh, iii] [(0, 111), (1, 222), (2, 333)] [(0, ggg), (1, hhh), (2, iii)]
In [318]: df.drop('id1 id2'.split(), axis=1, inplace=True)
In [319]: df
Out[319]:
Name id1e id2e
0 Name1 [(0, 12032), (1, 444), (2, 555)] [(0, aaa), (1, bbb), (2, ccc)]
1 Name2 [(0, 666), (1, 51206), (2, 777)] [(0, ddd), (1, eee), (2, fff)]
2 Name3 [(0, 111), (1, 222), (2, 333)] [(0, ggg), (1, hhh), (2, iii)]
In [320]: df.explode('id1e')
Out[320]:
Name id1e id2e
0 Name1 (0, 12032) [(0, aaa), (1, bbb), (2, ccc)]
0 Name1 (1, 444) [(0, aaa), (1, bbb), (2, ccc)]
0 Name1 (2, 555) [(0, aaa), (1, bbb), (2, ccc)]
1 Name2 (0, 666) [(0, ddd), (1, eee), (2, fff)]
1 Name2 (1, 51206) [(0, ddd), (1, eee), (2, fff)]
1 Name2 (2, 777) [(0, ddd), (1, eee), (2, fff)]
2 Name3 (0, 111) [(0, ggg), (1, hhh), (2, iii)]
2 Name3 (1, 222) [(0, ggg), (1, hhh), (2, iii)]
2 Name3 (2, 333) [(0, ggg), (1, hhh), (2, iii)]
In [321]: df = df.explode('id1e')
In [322]: df = df.explode('id2e')
In [323]: df
Out[323]:
Name id1e id2e
0 Name1 (0, 12032) (0, aaa)
0 Name1 (0, 12032) (1, bbb)
0 Name1 (0, 12032) (2, ccc)
0 Name1 (1, 444) (0, aaa)
0 Name1 (1, 444) (1, bbb)
0 Name1 (1, 444) (2, ccc)
0 Name1 (2, 555) (0, aaa)
0 Name1 (2, 555) (1, bbb)
0 Name1 (2, 555) (2, ccc)
1 Name2 (0, 666) (0, ddd)
1 Name2 (0, 666) (1, eee)
1 Name2 (0, 666) (2, fff)
1 Name2 (1, 51206) (0, ddd)
1 Name2 (1, 51206) (1, eee)
1 Name2 (1, 51206) (2, fff)
1 Name2 (2, 777) (0, ddd)
1 Name2 (2, 777) (1, eee)
1 Name2 (2, 777) (2, fff)
2 Name3 (0, 111) (0, ggg)
2 Name3 (0, 111) (1, hhh)
2 Name3 (0, 111) (2, iii)
2 Name3 (1, 222) (0, ggg)
2 Name3 (1, 222) (1, hhh)
2 Name3 (1, 222) (2, iii)
2 Name3 (2, 333) (0, ggg)
2 Name3 (2, 333) (1, hhh)
2 Name3 (2, 333) (2, iii)
In [324]: df['id1i'] = df.id1e.apply(lambda t:t[0])
In [325]: df
Out[325]:
Name id1e id2e id1i
0 Name1 (0, 12032) (0, aaa) 0
0 Name1 (0, 12032) (1, bbb) 0
0 Name1 (0, 12032) (2, ccc) 0
0 Name1 (1, 444) (0, aaa) 1
0 Name1 (1, 444) (1, bbb) 1
0 Name1 (1, 444) (2, ccc) 1
0 Name1 (2, 555) (0, aaa) 2
0 Name1 (2, 555) (1, bbb) 2
0 Name1 (2, 555) (2, ccc) 2
1 Name2 (0, 666) (0, ddd) 0
1 Name2 (0, 666) (1, eee) 0
1 Name2 (0, 666) (2, fff) 0
1 Name2 (1, 51206) (0, ddd) 1
1 Name2 (1, 51206) (1, eee) 1
1 Name2 (1, 51206) (2, fff) 1
1 Name2 (2, 777) (0, ddd) 2
1 Name2 (2, 777) (1, eee) 2
1 Name2 (2, 777) (2, fff) 2
2 Name3 (0, 111) (0, ggg) 0
2 Name3 (0, 111) (1, hhh) 0
2 Name3 (0, 111) (2, iii) 0
2 Name3 (1, 222) (0, ggg) 1
2 Name3 (1, 222) (1, hhh) 1
2 Name3 (1, 222) (2, iii) 1
2 Name3 (2, 333) (0, ggg) 2
2 Name3 (2, 333) (1, hhh) 2
2 Name3 (2, 333) (2, iii) 2
In [326]: df['id2i'] = df.id2e.apply(lambda t:t[0])
In [327]: df
Out[327]:
Name id1e id2e id1i id2i
0 Name1 (0, 12032) (0, aaa) 0 0
0 Name1 (0, 12032) (1, bbb) 0 1
0 Name1 (0, 12032) (2, ccc) 0 2
0 Name1 (1, 444) (0, aaa) 1 0
0 Name1 (1, 444) (1, bbb) 1 1
0 Name1 (1, 444) (2, ccc) 1 2
0 Name1 (2, 555) (0, aaa) 2 0
0 Name1 (2, 555) (1, bbb) 2 1
0 Name1 (2, 555) (2, ccc) 2 2
1 Name2 (0, 666) (0, ddd) 0 0
1 Name2 (0, 666) (1, eee) 0 1
1 Name2 (0, 666) (2, fff) 0 2
1 Name2 (1, 51206) (0, ddd) 1 0
1 Name2 (1, 51206) (1, eee) 1 1
1 Name2 (1, 51206) (2, fff) 1 2
1 Name2 (2, 777) (0, ddd) 2 0
1 Name2 (2, 777) (1, eee) 2 1
1 Name2 (2, 777) (2, fff) 2 2
2 Name3 (0, 111) (0, ggg) 0 0
2 Name3 (0, 111) (1, hhh) 0 1
2 Name3 (0, 111) (2, iii) 0 2
2 Name3 (1, 222) (0, ggg) 1 0
2 Name3 (1, 222) (1, hhh) 1 1
2 Name3 (1, 222) (2, iii) 1 2
2 Name3 (2, 333) (0, ggg) 2 0
2 Name3 (2, 333) (1, hhh) 2 1
2 Name3 (2, 333) (2, iii) 2 2
In [328]: df['id1'] = df.id1e.apply(lambda t: t[1])
In [329]: df['id2'] = df.id2e.apply(lambda t: t[1])
In [330]: df
Out[330]:
Name id1e id2e id1i id2i id1 id2
0 Name1 (0, 12032) (0, aaa) 0 0 12032 aaa
0 Name1 (0, 12032) (1, bbb) 0 1 12032 bbb
0 Name1 (0, 12032) (2, ccc) 0 2 12032 ccc
0 Name1 (1, 444) (0, aaa) 1 0 444 aaa
0 Name1 (1, 444) (1, bbb) 1 1 444 bbb
0 Name1 (1, 444) (2, ccc) 1 2 444 ccc
0 Name1 (2, 555) (0, aaa) 2 0 555 aaa
0 Name1 (2, 555) (1, bbb) 2 1 555 bbb
0 Name1 (2, 555) (2, ccc) 2 2 555 ccc
1 Name2 (0, 666) (0, ddd) 0 0 666 ddd
1 Name2 (0, 666) (1, eee) 0 1 666 eee
1 Name2 (0, 666) (2, fff) 0 2 666 fff
1 Name2 (1, 51206) (0, ddd) 1 0 51206 ddd
1 Name2 (1, 51206) (1, eee) 1 1 51206 eee
1 Name2 (1, 51206) (2, fff) 1 2 51206 fff
1 Name2 (2, 777) (0, ddd) 2 0 777 ddd
1 Name2 (2, 777) (1, eee) 2 1 777 eee
1 Name2 (2, 777) (2, fff) 2 2 777 fff
2 Name3 (0, 111) (0, ggg) 0 0 111 ggg
2 Name3 (0, 111) (1, hhh) 0 1 111 hhh
2 Name3 (0, 111) (2, iii) 0 2 111 iii
2 Name3 (1, 222) (0, ggg) 1 0 222 ggg
2 Name3 (1, 222) (1, hhh) 1 1 222 hhh
2 Name3 (1, 222) (2, iii) 1 2 222 iii
2 Name3 (2, 333) (0, ggg) 2 0 333 ggg
2 Name3 (2, 333) (1, hhh) 2 1 333 hhh
2 Name3 (2, 333) (2, iii) 2 2 333 iii
In [331]: df.drop('id1e id2e'.split(), axis=1, inplace=True)
In [332]: df
Out[332]:
Name id1i id2i id1 id2
0 Name1 0 0 12032 aaa
0 Name1 0 1 12032 bbb
0 Name1 0 2 12032 ccc
0 Name1 1 0 444 aaa
0 Name1 1 1 444 bbb
0 Name1 1 2 444 ccc
0 Name1 2 0 555 aaa
0 Name1 2 1 555 bbb
0 Name1 2 2 555 ccc
1 Name2 0 0 666 ddd
1 Name2 0 1 666 eee
1 Name2 0 2 666 fff
1 Name2 1 0 51206 ddd
1 Name2 1 1 51206 eee
1 Name2 1 2 51206 fff
1 Name2 2 0 777 ddd
1 Name2 2 1 777 eee
1 Name2 2 2 777 fff
2 Name3 0 0 111 ggg
2 Name3 0 1 111 hhh
2 Name3 0 2 111 iii
2 Name3 1 0 222 ggg
2 Name3 1 1 222 hhh
2 Name3 1 2 222 iii
2 Name3 2 0 333 ggg
2 Name3 2 1 333 hhh
2 Name3 2 2 333 iii
In [333]: df[df.id1.apply(lambda x: '120' in str(x))]
Out[333]:
Name id1i id2i id1 id2
0 Name1 0 0 12032 aaa
0 Name1 0 1 12032 bbb
0 Name1 0 2 12032 ccc
1 Name2 1 0 51206 ddd
1 Name2 1 1 51206 eee
1 Name2 1 2 51206 fff
In [334]: df = df[df.id1.apply(lambda x: '120' in str(x))]
In [335]: df[df.id1i == df.id2i]
Out[335]:
Name id1i id2i id1 id2
0 Name1 0 0 12032 aaa
1 Name2 1 1 51206 eee
In [336]: df[df.id1i == df.id2i]['id1 id2'.split()]
Out[336]:
id1 id2
0 12032 aaa
1 51206 eee
這是一個apply
function 可用於迭代數據並寫入名為output
。
# construct an output df
output = pd.DataFrame(index=df.index, columns=df.columns)
output['Name'] = df['Name']
def findvalue(df, value):
# check the words which contain the value
inlist = [value in word for word in df['identifierOne']]
try:
# this will throw error if True is not found
index = inlist.index(True)
# but if there is a True, write the correct things to `output`
one = df['identifierOne'][index]
two = df['identifierTwo'][index]
output.loc[df.name, 'identifierOne'] = one
output.loc[df.name, 'identifierTwo'] = two
except ValueError:
return
有了這個,您可以像這樣apply
function:
lookfor = '120'
df.apply(findvalue, axis=1, value=lookfor)
結果(即output
):
Name identifierOne identifierTwo
0 Name1 12032 aaa
1 Name2 51206 eee
2 Name3 NaN NaN
# note that these are strings, all dypes == object
這是非常重的循環,所以我想這不是最快的答案。 但我認為邏輯更基本一些。
一個快速說明是inlist.index(True)
操作只返回列表中第一個True
的索引。 如果您預計每個單元格中會多次出現該值,那么您可以執行以下findvalue
:
def findvalue(df, value):
# check the words which contain the value
inlist = [value in word for word in df['identifierOne']]
one = []
two = []
# now we explicitly check all of the booleans in `inlist`
for i, boolean in enumerate(inlist):
if boolean:
one.append(df['identifierOne'][i])
two.append(df['identifierTwo'][i])
# only write to `output` if there is something to write
if one:
output.loc[df.name, 'identifierOne'] = one
output.loc[df.name, 'identifierTwo'] = two
對於同一個示例,結果現在位於(字符串的)列表中:
Name identifierOne identifierTwo
0 Name1 [12032] [aaa]
1 Name2 [51206] [eee]
2 Name3 NaN NaN
您可以使用 apply 和不使用導入執行以下操作:
import pandas as pd
import numpy as np
df=pd.DataFrame([['Name1' , ['12032', '444', '555'], ['aaa', 'bbb', 'ccc']],
['Name2', ['666', '51206', '777'], ['ddd', 'eee', 'fff']],
['Name3', ['111', '222', '333'], ['ggg', 'hhh', 'iii']]],columns=['Name','identifierOne','identifierTwo'])
# this loops the items inside the series in the apply function
idx = df['identifierOne'].apply(lambda x: ''.join([str(x.index(y)) if '120' in str(y) else '' for y in x]))
rowindex = df[idx != ''].index
listindex = idx.iloc[rowindex].astype(int)
listindex.name = 'listindex'
subset = df[df.index.isin(rowindex)]
subset.index = subset.index.astype(int)
concat = pd.merge(subset, listindex, left_index=True, right_index=True)
concat['identifierOne'] = concat.apply(lambda x: x['identifierOne'][x['listindex']], axis=1)
concat['identifierTwo'] = concat.apply(lambda x: x['identifierTwo'][x['listindex']], axis=1)
給出結果:
concat[['Name','identifierOne','identifierTwo']]
Name identifierOne identifierTwo
0 Name1 12032 aaa
1 Name2 51206 eee
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.