[英]How can I remove duplicates in a list and change their corresponding values in another list (by index position) to the mean?
[英]Remove duplicates in one list and average corresponding list entries of another list
所以我有一個相當大的數據集,所以我需要編寫一些高效的東西。 我的數據包含一個列表中各個藝術家的專輯的發行年限以及另一個列表中每張專輯的平均歌曲長度。
作為一個例子,這里是一些虛構的數據。 歌曲長度在這里以分鍾為單位。
release_year=[2017,2017,2019,2020,2020,2021]
avg_songlength=[3,5,3,4,2,3]
我想獲得一個數據集,它刪除了 release_year 列表中的重復項,並且對於每個重復項,它再次平均歌曲長度。 所以我想要得到的結果是:
years_without duplicates=[2017,2019,2020,2021]
avg_length_of_year=[3+5/2,3,4+2/2,3]
我發現 set() 可以有效地刪除重復項,但我不知道如何將整個列表合並到另一個列表中,那么有什么簡單的方法可以做到這一點?
一種選擇是使用itertools.groupby
:
release_year=[2017,2017,2019,2020,2020,2021]
avg_songlength=[3,5,3,4,2,3]
from itertools import groupby
from statistics import mean
years_without_duplicates, avg_length_of_year = zip(*(
(k, mean(list(zip(*g))[1])) for k, g in
groupby(sorted(zip(release_year, avg_songlength)),
lambda x: x[0]))
)
years_without_duplicates, avg_length_of_year
# ((2017, 2019, 2020, 2021), (4, 3, 3, 3))
或使用collections.defaultdict
:
from collections import defaultdict
out = defaultdict(lambda : [0, 0]) # sum / count
for year, sl in zip(release_year, avg_songlength):
out[year][0] += sl # add length
out[year][1] += 1 # increment counter of occurrences
d = {k: v[0]/v[1] for k,v in out.items()} # avg = sum / count
years_without_duplicates, avg_length_of_year = zip(*d.items())
這是 go 在基礎 python 中的一種簡單方法。 這里的想法是將我們在字典中看到的年份存儲起來,並跟蹤總歌曲運行時間以及對總數做出貢獻的歌曲數量。 然后最后我們可以對字典中的鍵進行 go 並將它們轉換為平均運行時間。 使用字典還有助於使這些數據比兩個單獨的列表更結構化。
release_year=[2017,2017,2019,2020,2020,2021]
avg_songlength=[3,5,3,4,2,3]
year_averages = dict()
for year, length in zip(release_year, avg_songlength):
if year in year_averages:
year_averages[year][0] += length
year_averages[year][1] += 1
else:
year_averages[year] = [length, 1]
year_averages = {year: lst[0]/lst[1] for year, lst in year_averages.items()}
print(year_averages)
輸出:
{2017: 4.0, 2019: 3.0, 2020: 3.0, 2021: 3.0}
轉換為 A Pandas Dataframe 並使用聚合 function 作為 np.mean
import pandas as pd
import numpy as np
df = pd.DataFrame({"release_year":[2017,2017,2019,2020,2020,2021],"avg_song_length":[3,5,3,4,2,3]})
print(df)
print(df.groupby("release_year",as_index=False).agg(avg_length_of_year=("avg_song_length",np.mean)))
這是一種簡單的方法,使用一個字典來存儲每年值的總和,另一個來計算添加了多少值。
avg_dict = {}
count_dict = {}
for i in range(0, len(release_year)):
if str(release_year[i]) in avg_dict:
avg_dict[str(release_year[i])] = avg_dict[str(release_year[i])] + avg_songlength[i]
count_dict[str(release_year[i])] = count_dict[str(release_year[i])] + 1
else:
avg_dict[str(release_year[i])] = avg_songlength[i]
count_dict[str(release_year[i])] = 1
for key in avg_dict:
avg_dict[key] = avg_dict[key] / count_dict[key]
print(avg_dict) # {'2017': 4.0, '2019': 3.0, '2020': 3.0, '2021': 3.0}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.