[英]Function to calculate the percentage each value has in a pandas column
I am taking part in the Titanic Tutorials over at Kaggle to learn pandas/machine learning. 我正在参加Kaggle的泰坦尼克号教程学习熊猫/机器学习。
Here is my kernel: https://www.kaggle.com/trenzalore888/titanic/titanic-learning 这是我的内核: https : //www.kaggle.com/trenzalore888/titanic/titanic-learning
I want to create a function which takes two arguments, dataframe and column name. 我想创建一个带有两个参数的函数,dataframe和column name。 I want this function to calculate the percentage each class is (assuming it's binary, ie 0 or 1). 我希望这个函数计算每个类的百分比(假设它是二进制的,即0或1)。
I can do this hard coded ie to work specifically for the Titanic set, but I want to create a function so I can use it in the future. 我可以做这个硬编码,即专门为泰坦尼克号设置工作,但我想创建一个功能,以便我将来可以使用它。
Here is my failed attempt: 这是我失败的尝试:
traintotal=(len(train.index))
testtotal=(len(test.index))
def Is_data_imbalanced (df,objectivecolumn) :
objectivecount= df.objectivecolumn[df.objectivecolumn > 0].sum()
objectivecountpercentage=(objectivecount/traintotal)*100
objectivecountrounded= np.ceil(objectivecountpercentage)
return objectivecountrounded
Is_data_imbalanced(train,"Survived")
Unfortunately I get an attribute error: 不幸的是我收到属性错误:
AttributeError: 'DataFrame' object has no attribute 'objectivecolumn' AttributeError:'DataFrame'对象没有属性'objectivecolumn'
Below is the hardcoded version that works: 以下是有效的硬编码版本:
traintotal=(len(train.index))
print("there are", traintotal,"rows in the train data")
testtotal=(len(test.index))
print("there are {} rows in the test data".format(testtotal))
Survialcount= train.Survived[train.Survived > 0].sum()
Survialcountpercentage=(Survialcount/traintotal)*100
print(Survialcountpercentage)
survivalcountrounded= np.ceil(Survialcountpercentage)
print(" ",survivalcountrounded,"percent survived")
Does anyone know how I can get this to work? 有谁知道我怎么能让这个工作? It seems like it takes df
for train fine, but the 2nd argument columnname
for .Survived
is not working. 好像火车需要df
,但是.Survived
的第二个参数columnname
不起作用。
Assuming it really is binary then all you need is 假设它真的是二进制的,那么你需要的只是
def Is_data_imbalanced(df, objectivecolumn):
return int(df[objectivecolumn].mean() * 100)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.