![](/img/trans.png)
[英]How to set the same frequency for all the unique values found in a Pandas DataFrame?
[英]Make frequency table of unique rows in pandas dataframe containing missing values
我想為包含缺失值的pandas數據幀創建頻率表。 下面是一個示例數據框,其中包含缺失值以說明我的問題:
import pandas as pd
import numpy as np
car_names = pd.DataFrame({'name' : ['Batmobile','Toyota Corolla','Bike',
'Bike','Batmobile'],
'hp': [1000,120,np.nan,np.nan,900]})
car_attr = pd.DataFrame({"name": ["Bike","Toyota Corolla"],
"color": ["blue","red"]})
cars = car_names.merge(car_attr,how='left',on='name')
name hp color
0 Batmobile 1000.0 NaN
1 Toyota Corolla 120.0 red
2 Bike NaN blue
3 Bike NaN blue
4 Batmobile 900.0 NaN
我想要的輸出 ,一個顯示一個值組合的頻率表出現兩次:
name hp color count
0 Bike NaN blue 2
1 Batmobile 1000.0 NaN 1
2 Toyota Corolla 120.0 red 1
3 Batmobile 900.0 NaN 1
我嘗試過使用groupby()。size()但是這個方法排除了缺少值的行(即我的數據幀中除第二行之外的每一行):
cars.groupby(['name','hp','color']).size()
name hp color
Toyota Corolla 120.0 red 1
我嘗試過的另一種方法是將pandas數據幀轉換為列表列表(每行都是一個列表),並使用list.index()函數計算唯一行的出現次數,但我遇到了這個奇怪的錯誤:
my_rows = cars.values.tolist()
my_rows_dedup = cars.drop_duplicates().values.tolist()
for x in my_rows:
print(x)
print('Row index: ', my_rows.index(x),
' Unique Index: ', my_rows_dedup.index(x))
['Batmobile', 1000.0, nan]
Row index: 0 Unique Index: 0
['Toyota Corolla', 120.0, 'red']
Row index: 1 Unique Index: 1
['Bike', nan, 'blue']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-100-f17351883e95> in <module>
4 for x in my_rows:
5 print(x)
----> 6 print('Row index: ', my_rows.index(x), ' Unique Index: ', my_rows_dedup.index(x))
ValueError: ['Bike', nan, 'blue'] is not in list
這個錯誤對我沒有意義,因為['Bike',nan,'blue']是my_rows_dedup中的值。
好吧,如果你想處理空值,你可以先填充這些值然后再替換,如果你想將它們轉換回NaN
值
x = 'g8h.|$hTdo+jC9^@'
(cars.fillna(x).groupby(['name','hp','color']).size().reset_index()
.rename(columns={0 : 'count'}).replace(x,np.NaN))
name hp color count
0 Batmobile 900.0 NaN 1
1 Batmobile 1000.0 NaN 1
2 Bike NaN blue 2
3 Toyota Corolla 120.0 red 1
對於某些人來說,這可能有點矯枉過正,但這是一個與DataNovice的答案相符的功能。 我生成一個隨機字符串來填充NA值,並確保隨機字符串不在我的數據集中。 我還創建了一個百分比列,並在jupyter筆記本中顯示一些格式(隱藏索引,格式化逗號和百分比)。
import random
import string
import pandas as pd
import numpy as np
from IPython.display import display
# column names (optional) defined as string or list of strings
# Returns frequency table that includes missing values for df
# n limits the output to the first n rows
def tidy_count(df,columns=None,count_col='n',return_df=False,n=None):
# subset df by columns if specified
if columns != None:
if type(columns) != list:
columns = [columns]
df = df[columns]
col_vals = df.columns.values # all column values
# Find all distinct values in the dataset for both data values
# and column names, store in the list allvals
pds_allvals = pd.Series(col_vals)
for col in col_vals:
pds_allvals = pd.concat([pds_allvals,df[str(col)]]).drop_duplicates()
allvals = pds_allvals.tolist()
#print(allvals)
filler_val=None
# generate a random string for filler_val until it is
# not something that occurs in our data
loop_iters=1
while filler_val == None or filler_val in allvals:
filler_val = ''.join(random.choices(string.digits + string.ascii_letters, k=np.random.randint(4,12)))
loop_iters += 1
if loop_iters > 8:
print('Error: Cannot find unique string to fill NA values.')
break
#print(filler_val)
# reset count column name until it is unique among column names
loop_iters=1
while count_col in col_vals:
if count_col == 'n':
count_col = count_col + 'n'
else:
count_col = count_col + '_'
if loop_iters > 8:
print('Error: Cannot find unique string for counter column.')
break
loop_iters += 1
freq_table = df.fillna(filler_val).groupby([str(x) for x in col_vals]).size().\
reset_index(name=count_col).\
replace(filler_val,np.NaN).sort_values(count_col,ascending=False)
freq_table['percent'] = freq_table[count_col] / sum(freq_table[count_col])
# limit frequency table to first n rows if specified
if n != None:
freq_table = freq_table.head(n)
# Formatted pandas display for jupyter notebooks
df_display = display(freq_table.style.hide_index().\
format({'n': '{:,}','percent': '{0:.1%}'}))
if return_df == False:
df_display
else:
df_display
return(freq_table)
樣本用於seaborn包中的數據集:
## Import a dataset from seaborn and add missing values at random
import seaborn as sns
import numpy as np
planets = sns.load_dataset("planets")
import numpy as np
planets_method = planets.sample(frac=.40,random_state=42)
planets_year = planets.sample(frac=.20,random_state=84)
planets_method['method'] = np.nan
planets_year['year'] = np.nan
planets_combi = pd.concat([planets.\
drop(planets_method.index.union(planets_year.index)),
planets_method,planets_year]).sort_index()
## Use the function above to create a frequency table
# Note that it is cut off at row 7 (n=7) for display purposes
tidy_count(planets_combi,['method','year'],n=7)
method year n percent
Radial Velocity nan 104 9.2%
Transit nan 82 7.3%
nan 2011 70 6.2%
Transit 2012 56 5.0%
Radial Velocity 2011 52 4.6%
nan 2013 48 4.3%
nan 2010 47 4.2%
您可以通過在分組時替換它們來解決NaN問題,並在之后恢復它們。 這樣的事情可能是:
# Fillna, groupby columns and count duplicates
counts = cars.fillna(-1).groupby(cars.columns.tolist()).agg(len)
# Reshape the dataframe
cars_no_dups = pd.DataFrame(counts.reset_index().to_records()).set_index("index")
# Restore duplicates
cars_no_dups[cars_no_dups==-1] = np.nan
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.