[英]Subsetting a dataframe in pandas according to column name values
[英]Looping over columns and column values for subsetting pandas dataframe
我有一个数据框,如下df
:
ID color finish duration
A1 black smooth 12
A2 white matte 8
A3 blue smooth 20
A4 green matte 10
B1 black smooth 12
B2 white matte 8
B3 blue smooth
B4 green 10
C1 black smooth
C2 white matte 8
C3 blue smooth
C4 green 10
我想根据某些条件生成此数据框的子集。 例如, color= black
, finish = smooth
, duration = 12
,我得到以下数据帧。
ID color finish duration score
A1 black smooth 12 1
B1 black smooth 12 1
color= blue
, finish = smooth
, duration = 20
,我得到以下数据帧。
ID color finish duration score
A3 blue smooth 20 1
B3 blue smooth 0.666667
C3 blue smooth 0.666667
分数以填充的列数/总列数计算 。 我想在熊猫数据框中循环播放。 以下代码为我工作了两列。
list2 = list(df['color'].unique())
list3 = list(df['finish'].unique())
df_final = pd.DataFrame()
for i in range(len(list2)):
for j in range(len(list3)):
print 'Current Attribute Value:',list2[i],list3[j]
gbl["df_"+list2[i]] = df[df.color == list2[i]]
gbl["df_" + list2[i] + list3[j]] =
gbl["df_"+list2[i]].loc[gbl["df_"+list2[i]].finish == list3[j]]
gbl["df_" + list2[i] + list3[j]]['dfattribval'] = list2[i] + list3[j]
df_final = df_final.append(gbl["df_" + list2[i] + list3[j]], ignore_index=True)
但是,我无法在列名上循环。 我想做的是
lista = ['color','finish']
df_final = pd.DataFrame()
for a in range(len(lista)):
for i in range(len(list2)):
for j in range(len(list3)):
print 'Current Attribute Value:',lista[a],list2[i],lista[a+1],list3[j]
gbl["df_"+list2[i]] = df[df.lista[a] == list2[i]]
gbl["df_" + list2[i] + list3[j]] = gbl["df_"+list2[i]].loc[gbl["df_"+list2[i]].lista[a+1] == list3[j]]
gbl["df_" + list2[i] + list3[j]]['dfattribval'] = list2[i] + list3[j]
df_final = df_final.append(gbl["df_" + list2[i] + list3[j]], ignore_index=True)
我收到明显的错误-
AttributeError:“ DataFrame”对象没有属性“ lista”。
任何人都知道如何遍历列名和值。 在此先感谢!
不太确定您的需求,但请考虑使用列表理解来置换列表,以避免嵌套循环并使用数据帧字典。 可能可以调整scorecalc()
应用函数以适合您的需求:
colorlist = list(df['color'].unique())
finishlist = list(df['finish'].unique())
durationlist = list(df['duration'].unique())
# ALL COMBINATIONS BETWEEN LISTS
allList = [(c,f, d) for c in colorlist for f in finishlist for d in durationlist]
def scorecalc(row):
row['score'] = row['duration'].count()
return(row)
dfList = []; dfDict = {}
for i in allList:
# SUBSET DFS
tempdf = df[(df['color'] == i[0]) & (df['finish']==i[1]) & (df['duration']==i[2])]
if len(tempdf) > 0: # FOR NON-EMPTY DFS
print('Current Attribute Value:', i[0], i[1], i[2])
tempdf = tempdf.groupby(['color','finish']).apply(scorecalc)
tempdf['score'] = tempdf['score'] / len(tempdf)
print(tempdf)
key = str(i[0]) + str(i[1]) + str(i[2])
dfDict[key] = tempdf # DICTIONARY OF DFS (USE pd.DataFrame(list(...)) FOR FINAL)
dfList.append(tempdf) # LIST OF DFS (USE pd.concat() FOR FINAL DF)
# Current Attribute Value: black smooth 12.0
# ID color finish duration score
#0 A1 black smooth 12.0 1.0
#4 B1 black smooth 12.0 1.0
#Current Attribute Value: white matte 8.0
# ID color finish duration score
#1 A2 white matte 8.0 1.0
#5 B2 white matte 8.0 1.0
#9 C2 white matte 8.0 1.0
#Current Attribute Value: blue smooth 20.0
# ID color finish duration score
#2 A3 blue smooth 20.0 1.0
#Current Attribute Value: green matte 10.0
# ID color finish duration score
#3 A4 green matte 10.0 1.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.