[英]How can I structure this dataset for analysis & visualization? (Some columns contain lists instead of single values - Python & Pandas)
Question: How can I improve either my method ("expand_traits" posted below) or the data structure I am trying to use?问题:如何改进我的方法(下面发布的“expand_traits”)或我尝试使用的数据结构? I estimate the runtime of my solution to be a few hours, which seems like I went very wrong somewhere (considering it takes ~ 10 minutes to collect all of the data, and possibly a few hours to transform it into something I can analyze).
我估计我的解决方案的运行时间是几个小时,这似乎我在某个地方出了很大的问题(考虑到收集所有数据需要大约 10 分钟,并且可能需要几个小时才能将其转换为我可以分析的东西)。
I have collected some data that is essentially a Pandas DataFrame, where some columns in the table are a list of lists (technically formatted as strings, so when I evaluate them I am using ast.literal_eval(column) - if that's relevant).我收集了一些基本上是 Pandas DataFrame 的数据,其中表中的某些列是列表列表(技术上格式化为字符串,所以当我评估它们时,我使用ast.literal_eval(column) - 如果相关)。
To explain the context a bit:稍微解释一下上下文:
The data contains historical stats from League of Legends TFT game mode.数据包含英雄联盟 TFT 游戏模式的历史统计数据。 I am aiming to perform some analysis on it in terms of being able to group by each item in the list, and see how they perform on average.
我的目标是对其进行一些分析,以便能够按列表中的每个项目进行分组,并查看它们的平均表现。 I can only really think of doing this in terms of tables - something like df.groupby(by='Trait').mean() to get the average win-rate for each trait, but am open to other ideas.
我真的只能在表格方面考虑这样做——比如 df.groupby(by='Trait').mean() 来获得每个特征的平均胜率,但对其他想法持开放态度。
Here is an example of the dataset:以下是数据集的示例:
Rank![]() |
Summoner![]() |
Traits![]() |
Units![]() |
---|---|---|---|
1 ![]() |
name1![]() |
['7 Innovator', '1 Transformer', '3 Enchanter', '2 Socialite', '2 Clockwork', '2 Scholar', '2 Scrap'] ![]() |
['Ezreal', 'Singed', 'Zilean', 'Taric', 'Heimerdinger', 'Janna', 'Orianna', 'Seraphine', 'Jayce'] ![]() |
2 ![]() |
name2![]() |
['1 Cuddly', '1 Glutton', '5 Mercenary', '4 Bruiser', '6 Chemtech', '2 Scholar', '1 Socialite', '2 Twinshot'] ![]() |
['Illaoi', 'Gangplank', 'MissFortune', 'Lissandra', 'Zac', 'Urgot', 'DrMundo', 'TahmKench', 'Yuumi', 'Viktor'] ![]() |
The total records in the table is approximately 40,000 (doesn't sound like much) but my original idea was to basically "unpivot" the nested lists into their own record.表中的总记录约为 40,000(听起来并不多),但我最初的想法是基本上将嵌套列表“反透视”到他们自己的记录中。
My idea looks a little something like:我的想法看起来有点像:
Summoner![]() |
Trait![]() |
Record_ID![]() |
---|---|---|
name1![]() |
7 Innovator ![]() |
id_1 ![]() |
name1![]() |
1 Transformer ![]() |
id_1 ![]() |
... ![]() |
... ![]() |
... ![]() |
name2![]() |
1 Cuddly ![]() |
id_2 ![]() |
name2![]() |
1 Glutton ![]() |
id_2 ![]() |
Due to the number of items in each list, this transformation will turn my ~40,000 records into a few hundred thousand.由于每个列表中的项目数量,这种转换会将我的约 40,000 条记录变成几十万条。
Another thing to note is that because this transformation would be unique to each column that contains lists, I would need to perform it separately (as far as I know) on each column.另一件需要注意的事情是,因为这种转换对于包含列表的每一列都是唯一的,所以我需要在每一列上单独执行它(据我所知)。 Here is the current code I am using to do this to the "Traits" column, which takes my computer around 35 mins to complete (also pretty average PC - nothing crazy but equivalent to intel i5 & 16 gigs of RAM.
这是我用来在“特征”列中执行此操作的当前代码,这需要我的计算机大约 35 分钟才能完成(也是相当普通的 PC - 没什么疯狂但相当于 intel i5 和 16 gigs 的 RAM。
def expand_traits(traits_df):
traits_df_expanded = pd.DataFrame()
for i in range(len(traits_df)):
traits = ast.literal_eval(traits_df.Traits[i])
for trait in traits:
record = {
'Summoner': traits_df.Summoner[i],
'Trait': trait,
'match_id': str(traits_df.match_id[i])
}
traits_df_expanded = traits_df_expanded.append(record, ignore_index=True)
Is this approach logical?这种方法合乎逻辑吗? Or am I missing something here.
或者我在这里错过了什么。
I can't imagine this being the optimal method - I also might have gone wrong somewhere in my expand_traits method.我无法想象这是最佳方法——我的 expand_traits 方法中的某个地方也可能出错了。
Use explode
:使用
explode
:
cols = ['Summoner', 'Traits', 'Record_ID']
out = df.assign(Record_ID='id_' + df['Rank'].astype(str))[cols] \
.explode('Traits', ignore_index=True) \
.rename(columns={'Traits': 'Trait'})
print(out)
# Output:
Summoner Trait Record_ID
0 name1 7 Innovator id_1
1 name1 1 Transformer id_1
2 name1 3 Enchanter id_1
3 name1 2 Socialite id_1
4 name1 2 Clockwork id_1
5 name1 2 Scholar id_1
6 name1 2 Scrap id_1
7 name2 1 Cuddly id_2
8 name2 1 Glutton id_2
9 name2 5 Mercenary id_2
10 name2 4 Bruiser id_2
11 name2 6 Chemtech id_2
12 name2 2 Scholar id_2
13 name2 1 Socialite id_2
14 name2 2 Twinshot id_2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.