[英]python pandas applying for loop and groupby function
我是python的新手,我不熟悉迭代pandas中的groupby函數的過程,我修改了下面的代碼,它對於創建pandas數據框非常有效
i=['J,Smith,200 G Ct,',
'E,Johnson,200 G Ct,',
'A,Johnson,200 G Ct,',
'M,Simpson,63 F Wy,',
'L,Diablo,60 N Blvd,',
'H,Simpson,63 F Wy,',
'B,Simpson,63 F Wy,']
dbn=[]
dba=[]
for z,g in groupby(
sorted([l.split(',')for l in i],
key=lambda x:x[1:]),
lambda x:x[2:]
):
l=list(g);r=len(l);Address=','.join(z);o=l[0]
if r>2:
dbn.append('The '+o[1]+" Family,")
dba.append(Address)
elif r>1:
dbn.append(o[0]+" and "+l[1][0]+", "+o[1]+",")
dba.append(Address)
else:
dbn.append(o[0]+" "+o[1])
# print','.join(o),
dba.append(Address)
Hdf=pd.DataFrame({'Address':dba,'Name':dbn})
print Hdf
Address Name
0 60 N Blvd, L Diablo
1 200 G Ct, E and A, Johnson,
2 63 F Wy, The Simpson Family,
3 200 G Ct, J Smith
如果我使用的是熊貓數據框而不是原始的csv數據,如何修改for循環以產生相同的結果?
df=pd.DataFrame({'Name':['J','E','A','M','L','H','B'],
'Lastname':['Smith','Johnson','Johnson','Simpson','Diablo','Simpson','Simpson'],
'Address':['200 G Ct','200 G Ct','200 G Ct','63 F Wy','60 N Blvd','63 F Wy','63 F Wy']})
首先,我們創建輔助函數並將數據按Lastname, Address
分組:
def helper(k, g):
r = len(g)
address, lastname = k
if r > 2:
lastname = 'The {} Family'.format(lastname)
elif r > 1:
lastname = ' and '.join(g['Name']) + ', ' + lastname
else:
lastname = g['Name'].squeeze() + ' ' + lastname
return (address, lastname)
grouped = df.groupby(['Address', 'Lastname'])
然后創建將生成器函數應用於每個組的生成器:
vals = (helper(k, g) for k, g in grouped)
然后從中創建結果DataFrame:
pd.DataFrame(vals, columns=['Address','Name'])
Address Name
0 200 G Ct E and A, Johnson
1 200 G Ct J Smith
2 60 N Blvd L Diablo
3 63 F Wy The Simpson Family
按姓氏Lastname, Address
數據進行分組,然后生成長度為組和字符串的新DataFrame,其中包含兩個串聯的名字:
grouped = df.groupby(['Address', 'Lastname'])
res = grouped.apply(lambda x: pd.Series({'Len': len(x), 'Names': ' and '.join(x['Name'][:2])})).reset_index()
Address Lastname Len Names
0 200 G Ct Johnson 2 E and A
1 200 G Ct Smith 1 J
2 60 N Blvd Diablo 1 L
3 63 F Wy Simpson 3 M and H
現在,只需應用常規的pandas轉換並刪除不需要的列:
res.ix[res['Len'] > 2, 'Lastname'] = 'The ' + res['Lastname'] + ' Family'
res.ix[res['Len'] == 2, 'Lastname'] = res['Names'] + ', ' + res['Lastname']
res.ix[res['Len'] < 2, 'Lastname'] = res['Names'] + ' ' + res['Lastname']
del res['Len']
del res['Names']
Address Lastname
0 200 G Ct E and A, Johnson
1 200 G Ct J Smith
2 60 N Blvd L Diablo
3 63 F Wy The Simpson Family
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.