简体   繁体   English

如何构建此数据集以进行分析和可视化? (有些列包含列表而不是单个值 - Python & Pandas)

[英]How can I structure this dataset for analysis & visualization? (Some columns contain lists instead of single values - Python & Pandas)

Question: How can I improve either my method ("expand_traits" posted below) or the data structure I am trying to use?问题:如何改进我的方法(下面发布的“expand_traits”)或我尝试使用的数据结构? I estimate the runtime of my solution to be a few hours, which seems like I went very wrong somewhere (considering it takes ~ 10 minutes to collect all of the data, and possibly a few hours to transform it into something I can analyze).我估计我的解决方案的运行时间是几个小时,这似乎我在某个地方出了很大的问题(考虑到收集所有数据需要大约 10 分钟,并且可能需要几个小时才能将其转换为我可以分析的东西)。

I have collected some data that is essentially a Pandas DataFrame, where some columns in the table are a list of lists (technically formatted as strings, so when I evaluate them I am using ast.literal_eval(column) - if that's relevant).我收集了一些基本上是 Pandas DataFrame 的数据,其中表中的某些列是列表列表(技术上格式化为字符串,所以当我评估它们时,我使用ast.literal_eval(column) - 如果相关)。

To explain the context a bit:稍微解释一下上下文:

The data contains historical stats from League of Legends TFT game mode.数据包含英雄联盟 TFT 游戏模式的历史统计数据。 I am aiming to perform some analysis on it in terms of being able to group by each item in the list, and see how they perform on average.我的目标是对其进行一些分析,以便能够按列表中的每个项目进行分组,并查看它们的平均表现。 I can only really think of doing this in terms of tables - something like df.groupby(by='Trait').mean() to get the average win-rate for each trait, but am open to other ideas.我真的只能在表格方面考虑这样做——比如 df.groupby(by='Trait').mean() 来获得每个特征的平均胜率,但对其他想法持开放态度。

Here is an example of the dataset:以下是数据集的示例:

Rank Summoner召唤师 Traits性状 Units单位
1 1 name1名称1 ['7 Innovator', '1 Transformer', '3 Enchanter', '2 Socialite', '2 Clockwork', '2 Scholar', '2 Scrap'] ['7 创新者'、'1 变形金刚'、'3 魔法师'、'2 社交名流'、'2 发条'、'2 学者'、'2 废品'] ['Ezreal', 'Singed', 'Zilean', 'Taric', 'Heimerdinger', 'Janna', 'Orianna', 'Seraphine', 'Jayce'] [“伊泽瑞尔”、“辛吉德”、“齐莱恩”、“塔里克”、“黑默丁格”、“迦娜”、“奥莉安娜”、“塞拉芬”、“杰斯”]
2 2 name2名称2 ['1 Cuddly', '1 Glutton', '5 Mercenary', '4 Bruiser', '6 Chemtech', '2 Scholar', '1 Socialite', '2 Twinshot'] ['1 Cuddly'、'1 Glutton'、'5 Mercenary'、'4 Bruiser'、'6 Chemtech'、'2 Scholar'、'1 Socialite'、'2 Twinshot'] ['Illaoi', 'Gangplank', 'MissFortune', 'Lissandra', 'Zac', 'Urgot', 'DrMundo', 'TahmKench', 'Yuumi', 'Viktor'] ['俄洛伊','跳板','MissFortune','丽桑卓','扎克','厄加特','DrMundo','TahmKench','Yuumi','Viktor']

The total records in the table is approximately 40,000 (doesn't sound like much) but my original idea was to basically "unpivot" the nested lists into their own record.表中的总记录约为 40,000(听起来并不多),但我最初的想法是基本上将嵌套列表“反透视”到他们自己的记录中。

My idea looks a little something like:我的想法看起来有点像:

Summoner召唤师 Trait特征 Record_ID记录_ID
name1名称1 7 Innovator 7 创新者 id_1 id_1
name1名称1 1 Transformer 1个变压器 id_1 id_1
... ... ... ... ... ...
name2名称2 1 Cuddly 1 可爱的 id_2 id_2
name2名称2 1 Glutton 1 贪吃 id_2 id_2

Due to the number of items in each list, this transformation will turn my ~40,000 records into a few hundred thousand.由于每个列表中的项目数量,这种转换会将我的约 40,000 条记录变成几十万条。

Another thing to note is that because this transformation would be unique to each column that contains lists, I would need to perform it separately (as far as I know) on each column.另一件需要注意的事情是,因为这种转换对于包含列表的每一列都是唯一的,所以我需要在每一列上单独执行它(据我所知)。 Here is the current code I am using to do this to the "Traits" column, which takes my computer around 35 mins to complete (also pretty average PC - nothing crazy but equivalent to intel i5 & 16 gigs of RAM.这是我用来在“特征”列中执行此操作的当前代码,这需要我的计算机大约 35 分钟才能完成(也是相当普通的 PC - 没什么疯狂但相当于 intel i5 和 16 gigs 的 RAM。

def expand_traits(traits_df):
    traits_df_expanded = pd.DataFrame()
    for i in range(len(traits_df)):
        traits = ast.literal_eval(traits_df.Traits[i])
        for trait in traits:   
            record = {
                'Summoner': traits_df.Summoner[i],
                'Trait': trait,
                'match_id': str(traits_df.match_id[i])
                }
            traits_df_expanded = traits_df_expanded.append(record, ignore_index=True)

Is this approach logical?这种方法合乎逻辑吗? Or am I missing something here.或者我在这里错过了什么。

I can't imagine this being the optimal method - I also might have gone wrong somewhere in my expand_traits method.我无法想象这是最佳方法——我的 expand_traits 方法中的某个地方也可能出错了。

Use explode :使用explode

cols = ['Summoner', 'Traits', 'Record_ID']
out = df.assign(Record_ID='id_' + df['Rank'].astype(str))[cols] \
        .explode('Traits', ignore_index=True) \
        .rename(columns={'Traits': 'Trait'})
print(out)

# Output:
   Summoner          Trait Record_ID
0     name1    7 Innovator      id_1
1     name1  1 Transformer      id_1
2     name1    3 Enchanter      id_1
3     name1    2 Socialite      id_1
4     name1    2 Clockwork      id_1
5     name1      2 Scholar      id_1
6     name1        2 Scrap      id_1
7     name2       1 Cuddly      id_2
8     name2      1 Glutton      id_2
9     name2    5 Mercenary      id_2
10    name2      4 Bruiser      id_2
11    name2     6 Chemtech      id_2
12    name2      2 Scholar      id_2
13    name2    1 Socialite      id_2
14    name2     2 Twinshot      id_2

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从行到列构造数据集pandas python - Structure dataset from rows to columns pandas python 如何使用 Python Pandas 仅显示某些列? - How can I show only some columns using Python Pandas? Python Pandas 将 3 列列表合并为一列 - Python Pandas merge 3 columns of lists in to a single column 是否有 Python Pandas 函数可以重命名列中的值,如果值列表中包含文本而不是单个单词或数字? - Is there a Python Pandas function to rename values in a column, if the values replesent lists with text inside instead of a single word or number? 在 Pandas 数据框中提取列列表包含某些值的行 - Extract rows where the lists of columns contain certain values in a pandas dataframe 如何使用 python 中的 pandas 从我的 json 数据集中提取包含特定关键字的特定行? - how can I extract specific row which contain specific keyword from my json dataset using pandas in python? 在 Python 中将 pandas 数据框保存为 txt 文件,其中数据框列包含单个 int 值或 python 列表 - Save pandas dataframe as txt file in Python, with dataframe columns containing either single int values or python lists Matplotlib Pandas:3 列的可视化(Python) - Matplotlib Pandas: visualization of 3 columns (Python) 我有一个数据集,我需要将一些列转换为单个分类变量并连接多个真值 - I have a dataset where I need to convert some columns into single categorical variables & concatenating multi true values 如何在字典中使用包含特定索引和列的列表的字典来创建Pandas DataFrame? - How can I create Pandas DataFrame from a dict with lists with specific indexes and columns in Python3?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM