如何构建此数据集以进行分析和可视化？（有些列包含列表而不是单个值 - Python & Pandas）

Question

Question: How can I improve either my method ("expand_traits" posted below) or the data structure I am trying to use?问题：如何改进我的方法（下面发布的“expand_traits”）或我尝试使用的数据结构？ I estimate the runtime of my solution to be a few hours, which seems like I went very wrong somewhere (considering it takes ~ 10 minutes to collect all of the data, and possibly a few hours to transform it into something I can analyze).我估计我的解决方案的运行时间是几个小时，这似乎我在某个地方出了很大的问题（考虑到收集所有数据需要大约 10 分钟，并且可能需要几个小时才能将其转换为我可以分析的东西）。

I have collected some data that is essentially a Pandas DataFrame, where some columns in the table are a list of lists (technically formatted as strings, so when I evaluate them I am using ast.literal_eval(column) - if that's relevant).我收集了一些基本上是 Pandas DataFrame 的数据，其中表中的某些列是列表列表（技术上格式化为字符串，所以当我评估它们时，我使用ast.literal_eval(column) - 如果相关）。

To explain the context a bit:稍微解释一下上下文：

The data contains historical stats from League of Legends TFT game mode.数据包含英雄联盟 TFT 游戏模式的历史统计数据。 I am aiming to perform some analysis on it in terms of being able to group by each item in the list, and see how they perform on average.我的目标是对其进行一些分析，以便能够按列表中的每个项目进行分组，并查看它们的平均表现。 I can only really think of doing this in terms of tables - something like df.groupby(by='Trait').mean() to get the average win-rate for each trait, but am open to other ideas.我真的只能在表格方面考虑这样做——比如 df.groupby(by='Trait').mean() 来获得每个特征的平均胜率，但对其他想法持开放态度。

Here is an example of the dataset:以下是数据集的示例：

Rank秩	Summoner召唤师	Traits性状	Units单位
1 1	name1名称1	['7 Innovator', '1 Transformer', '3 Enchanter', '2 Socialite', '2 Clockwork', '2 Scholar', '2 Scrap'] ['7 创新者'、'1 变形金刚'、'3 魔法师'、'2 社交名流'、'2 发条'、'2 学者'、'2 废品']	['Ezreal', 'Singed', 'Zilean', 'Taric', 'Heimerdinger', 'Janna', 'Orianna', 'Seraphine', 'Jayce'] [“伊泽瑞尔”、“辛吉德”、“齐莱恩”、“塔里克”、“黑默丁格”、“迦娜”、“奥莉安娜”、“塞拉芬”、“杰斯”]
2 2	name2名称2	['1 Cuddly', '1 Glutton', '5 Mercenary', '4 Bruiser', '6 Chemtech', '2 Scholar', '1 Socialite', '2 Twinshot'] ['1 Cuddly'、'1 Glutton'、'5 Mercenary'、'4 Bruiser'、'6 Chemtech'、'2 Scholar'、'1 Socialite'、'2 Twinshot']	['Illaoi', 'Gangplank', 'MissFortune', 'Lissandra', 'Zac', 'Urgot', 'DrMundo', 'TahmKench', 'Yuumi', 'Viktor'] ['俄洛伊'，'跳板'，'MissFortune'，'丽桑卓'，'扎克'，'厄加特'，'DrMundo'，'TahmKench'，'Yuumi'，'Viktor']

The total records in the table is approximately 40,000 (doesn't sound like much) but my original idea was to basically "unpivot" the nested lists into their own record.表中的总记录约为 40,000（听起来并不多），但我最初的想法是基本上将嵌套列表“反透视”到他们自己的记录中。

My idea looks a little something like:我的想法看起来有点像：

Summoner召唤师	Trait特征	Record_ID记录_ID
name1名称1	7 Innovator 7 创新者	id_1 id_1
name1名称1	1 Transformer 1个变压器	id_1 id_1
... ...	... ...	... ...
name2名称2	1 Cuddly 1 可爱的	id_2 id_2
name2名称2	1 Glutton 1 贪吃	id_2 id_2

Due to the number of items in each list, this transformation will turn my ~40,000 records into a few hundred thousand.由于每个列表中的项目数量，这种转换会将我的约 40,000 条记录变成几十万条。

Another thing to note is that because this transformation would be unique to each column that contains lists, I would need to perform it separately (as far as I know) on each column.另一件需要注意的事情是，因为这种转换对于包含列表的每一列都是唯一的，所以我需要在每一列上单独执行它（据我所知）。 Here is the current code I am using to do this to the "Traits" column, which takes my computer around 35 mins to complete (also pretty average PC - nothing crazy but equivalent to intel i5 & 16 gigs of RAM.这是我用来在“特征”列中执行此操作的当前代码，这需要我的计算机大约 35 分钟才能完成（也是相当普通的 PC - 没什么疯狂但相当于 intel i5 和 16 gigs 的 RAM。

def expand_traits(traits_df):
    traits_df_expanded = pd.DataFrame()
    for i in range(len(traits_df)):
        traits = ast.literal_eval(traits_df.Traits[i])
        for trait in traits:   
            record = {
                'Summoner': traits_df.Summoner[i],
                'Trait': trait,
                'match_id': str(traits_df.match_id[i])
                }
            traits_df_expanded = traits_df_expanded.append(record, ignore_index=True)

Is this approach logical?这种方法合乎逻辑吗？ Or am I missing something here.或者我在这里错过了什么。

I can't imagine this being the optimal method - I also might have gone wrong somewhere in my expand_traits method.我无法想象这是最佳方法——我的 expand_traits 方法中的某个地方也可能出错了。

Answer 1

Use explode :使用explode ：

cols = ['Summoner', 'Traits', 'Record_ID']
out = df.assign(Record_ID='id_' + df['Rank'].astype(str))[cols] \
        .explode('Traits', ignore_index=True) \
        .rename(columns={'Traits': 'Trait'})
print(out)

# Output:
   Summoner          Trait Record_ID
0     name1    7 Innovator      id_1
1     name1  1 Transformer      id_1
2     name1    3 Enchanter      id_1
3     name1    2 Socialite      id_1
4     name1    2 Clockwork      id_1
5     name1      2 Scholar      id_1
6     name1        2 Scrap      id_1
7     name2       1 Cuddly      id_2
8     name2      1 Glutton      id_2
9     name2    5 Mercenary      id_2
10    name2      4 Bruiser      id_2
11    name2     6 Chemtech      id_2
12    name2      2 Scholar      id_2
13    name2    1 Socialite      id_2
14    name2     2 Twinshot      id_2

如何构建此数据集以进行分析和可视化？（有些列包含列表而不是单个值 - Python & Pandas）

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-12-20 20:17:46

如何构建此数据集以进行分析和可视化？ （有些列包含列表而不是单个值 - Python &amp; Pandas）

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-12-20 20:17:46

如何构建此数据集以进行分析和可视化？（有些列包含列表而不是单个值 - Python & Pandas）

解决方案1
2 已采纳 2021-12-20 20:17:46