[英]Calculating Frequency and % of frequency for all categorical variables in a data frame Python
I am new to python and I am working on a requirement to list all unique value in a categorical column along with frequency of each value and the % frequency of each value in a column and using a for loop to perform it on the complete dataset.我是 python 的新手,我正在处理一项要求,列出分类列中的所有唯一值以及每个值的频率和列中每个值的频率百分比,并使用 for 循环在完整的数据集上执行它。 Also I am not sure if I have to use pd.Series to append data into a dataframe as per the screenshot attached because the length of the columns are different based on the unique values in a column.此外,我不确定是否必须使用 pd.Series 根据附加的屏幕截图将数据附加到数据框中,因为列的长度因列中的唯一值而异。
Appreciate your help.感谢你的帮助。
The below is the code I tried to work out but I am not able to workout on the other columns for unique value and % of frequency and create it as a data frame so that I can export it to CSV下面是我尝试解决的代码,但我无法在其他列上锻炼以获得唯一值和频率百分比并将其创建为数据框,以便我可以将其导出为 CSV
Count_df = []
for item in df.columns:
Count_df_ = pd.DataFrame(df1[item].value_counts())
Count_df.append(Count_df_)
Count_dfdf = pd.DataFrame(Count_df)
Count_dfdf
Count_dfdf.to_csv(path_or_buf = Output + '_' + 'Count_.csv')
The input and Output expected is as below and the same is attached as an :预期的输入和输出如下,并附加为:
[Input data and expected Output ][1] [输入数据和预期输出][1]
Thanks in advance提前致谢
No magic.没有魔法。 Just append the output DataFrame column-by-column patiently.只需耐心地逐列附加输出数据帧。
Here I assume a 4-columned output in a single .csv
file.在这里,我假设单个.csv
文件中有 4 列输出。 Based on personal work experience, this format is more handy than separate files for spreadsheet softwares.根据个人工作经验,这种格式比电子表格软件的单独文件更方便。 However, separated output is also possible within in the loop.然而,在循环内也可以分离输出。
Code :代码:
import pandas as pd
# please provide copy-able sample data next time
df = pd.DataFrame(
data={
"Name": ["A", "B", "C", "C", "A", "F"],
"col2": [True, False, False, False, False, True],
"col3": [1, 2, 3, 1, 1, 3],
}
)
# Construct an empty dataframe with convenient column order.
# The ordering can be adjusted later on.
df_ans = pd.DataFrame(
data={
"var_name": [],
"var_count": [],
"var_freq": [],
"col_name": [],
}
)
# process each column
for col in df.columns:
# get variable name and count
df_col_count = df[col].value_counts().to_frame().reset_index()
# rename columns
df_col_count.columns = ["var_name", "var_count"]
# compute frequency
df_col_count["var_freq"] = df_col_count["var_count"] / df_col_count["var_count"].sum()
# append column name
df_col_count["col_name"] = col
# sort (optional)
# (1) by name
df_col_count.sort_values(by="var_name", inplace=True)
# (2) by descending frequency
# df_col_count.sort_values(by="var_freq", ascending=False, inplace=True)
# append
df_ans = df_ans.append(df_col_count)
# For separated CSV output, output here (and "col_name" can be removed)
#df_col_count.to_csv(f"/path/to/{col}_freq.csv")
# reorder columns
df_ans = df_ans[["col_name", "var_name", "var_count", "var_freq"]]
# reindex
df_ans.reset_index(drop=True, inplace=True)
# write csv
# df_ans.to_csv(f"/path/to/all_freq.csv")
Output输出
# Each column (variable) is sorted by name.
df_ans
Out[12]:
col_name var_name var_count var_freq
0 Name A 2.0 0.333333
1 Name B 1.0 0.166667
2 Name C 2.0 0.333333
3 Name F 1.0 0.166667
4 col2 False 4.0 0.666667
5 col2 True 2.0 0.333333
6 col3 1 3.0 0.500000
7 col3 2 1.0 0.166667
8 col3 3 2.0 0.333333
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.