简体   繁体   English

Pandas Dataframe 分组/组合列?

[英]Pandas Dataframe grouping / combining columns?

I'm new to Pandas, and I'm having a horrible time figuring out datasets.我是 Pandas 的新手,我在弄清楚数据集方面遇到了可怕的时间。 I have a csv file I've read in using pandas.read_csv, dogData , that looks as follows:我有一个使用 pandas.read_csv, dogData读入的 csv 文件,如下所示:

在此处输入图片说明

The column names are dog breeds, the first line [0] refers to the size of the dogs, and beyond that there's a bunch of numerical values.列名是狗的品种,第一行 [0] 指的是狗的大小,除此之外还有一堆数值。 The very first column has string description that I need to keep, but isn't relevant to the question.第一列有我需要保留的字符串描述,但与问题无关。 The last column for each size category contains separate "Average" values.每个尺寸类别的最后一列包含单独的“平均值”值。 (Note that it changed the "Average" columns to "Average.1", "Average.2" and so on, to take care of them not being unique) (请注意,它将“Average”列更改为“Average.1”、“Average.2”等,以确保它们不是唯一的)

Basically, I want to "group" by the first row - so all "small" dog values will be averaged except the "small" average column, and so on.基本上,我想按第一行“分组” - 所以除了“小”平均列之外,所有“小”狗值都将被平均,依此类推。 The result would look like something like this:结果看起来像这样:

在此处输入图片说明

The existing "Average" columns should not be included in the new average being calculated.现有的“平均”列应该包括在计算新的平均值为。 The existing "Average" columns for each size don't need to be altered at all.根本不需要更改每种尺寸的现有“平均”列。 All "small" breed values should be averaged, all "medium" breed values should be averaged, and so on (actual file is much larger then the sample I showed here).所有“小”品种值都应该平均,所有“中”品种值都应该平均,依此类推(实际文件比我在这里展示的样本大得多)。

There's no guarantee the breeds won't be altered, and no guarantee the "sizes" will remain the same / always be included ("Small" could be left out, for example).不能保证品种不会改变,也不能保证“尺寸”会保持不变/总是包含在内(例如,“小”可以被排除在外)。

EDIT:: After Joe Ferndz's comment, I've updated my code and have something slightly closer to working, but the actual adding-the-columns is giving me trouble still.编辑:: 在Joe Ferndz 发表评论之后,我更新了我的代码并且有一些更接近工作的东西,但实际的添加列仍然给我带来了麻烦。

dogData = pd.read_csv("dogdata.csv", header=[0,1])
dogData.columns = dogData.columns.map("_".join)

totalVal = ""
count = 0

for col in dogData:
    if "Unnamed" in col:
        continue  # to skip starting columns
    if "Average" not in col:
        totalVal += dogData[col]
        count += 1
    else:
        # this is where I'd calculate average, then reset count and totalVal
        # right now, because the addition isn't working, I'm haven't figured that out
        break

print(totalVal)

Now, this code is getting the correct values technically... but it won't let me numerically add them (hence why totalVal is a string right now).现在,这段代码在技术上得到了正确的值......但它不会让我用数字添加它们(因此为什么totalVal现在是一个字符串)。 It gives me a string of concatenated numbers, the correct concatenated numbers, but won't let me convert them to floats to actually add.它给了我一串连接数字,正确的连接数字,但不会让我将它们转换为浮点数来实际添加。

I've tried doing float(dogData[col]) for the totalVal addition line - it gives me a TypeError: cannot convert the series to <class float>我试过为totalVal添加行做float(dogData[col]) - 它给了我一个TypeError: cannot convert the series to <class float>

I've tried keeping it as a string, putting in "," between the numbers, then doing totalVal.split(",") to separate them, then convert and add... but obviously that doesn't work either, because AttributeError: 'Series' has no attribute 'split'我试过将它保留为字符串,在数字之间放入“,”,然后执行totalVal.split(",")将它们分开,然后转换并添加......但显然这也不起作用,因为AttributeError: 'Series' has no attribute 'split'

These errors make sense to me and I understand why it's happening, but I don't know what the correct method for doing this is.这些错误对我来说很有意义,我明白为什么会这样,但我不知道这样做的正确方法是什么。 dogData[col] gives me all the values for every row at once, which is what I want, but I don't know how to then store that and add it in the next iteration of the loop. dogData[col]为我提供了每一行的所有值,这正是我想要的,但我不知道如何存储它并将其添加到循环的下一次迭代中。

Here's a copy/pastable sample of data:这是一个复制/可粘贴的数据示例:

,Corgi,Yorkie,Pug,Average,Average,Dalmation,German Shepherd,Average,Great Dane,Average  
,Small,Small,Small,Small,Medium,Large,Large,Large,Very Large,Very Large  
Words,1,3,3,3,2.4,3,5,7,7,7  
Words1,2,2,4,4,2.2,4,4,6,8,8  
Words2,2,1,5,3,2.5,5,3,8,9,6  
Words3,1,4,4,2,2.7,6,6,5,6,9  

You have to do a few tricks to get this to work.你必须做一些技巧才能让它发挥作用。 Step 1: You need to read the csv file and use first two rows as header.第 1 步:您需要读取 csv 文件并使用前两行作为标题。 It will create a MultiIndex column list.它将创建一个 MultiIndex 列列表。

Step 2: You need to join them together with say an _.第 2 步:您需要用 _ 将它们连接在一起。

Step 3: Then rename the specific columns as per your requirement like S-Average, M-Average, ....第 3 步:然后根据您的要求重命名特定列,如 S-Average、M-Average....

Step 4: find out how many columns have dog name + small第 4 步:找出有多少列有狗名 + 小号

Step 5: Compute value for Small.第 5 步:计算 Small 的值。 Per your req, sum (columns with Small) / count (columns with Small)根据您的要求,总和(小列)/计数(小列)

Step 6,7: do same for Large步骤 6,7:对大号做同样的事情

Step 8,9: do same for Very Large步骤 8,9:对非常大做同样的事情

This will give you the final list.这将为您提供最终列表。 If you want the columns to be in specific order, then you can change the order.如果您希望列按特定顺序排列,则可以更改顺序。

Step 10: Change the order for the dataframe步骤 10:更改数据框的顺序

import pandas as pd
df = pd.read_csv('abc.txt',header=[0,1], index_col=0)
df.columns = df.columns.map('_'.join)
df.rename(columns={'Average_Small': 'S-Average',
                   'Average_Medium': 'M-Average',
                   'Average_Large': 'L-Average',
                   'Average_Very Large': 'Very L-Average'}, inplace = True)

idx = [i for i,x in enumerate(df.columns) if x.endswith('_Small')]
if idx:
    df['Small']= ((df.iloc[:, idx].sum(axis=1))/len(idx)).round(2)
    df.drop(df.columns[idx], axis = 1, inplace = True)

idx = [i for i,x in enumerate(df.columns) if x.endswith('_Large')]
if idx:
    df['Large']= ((df.iloc[:, idx].sum(axis=1))/len(idx)).round(2)
    df.drop(df.columns[idx], axis = 1, inplace = True)

idx = [i for i,x in enumerate(df.columns) if x.endswith('_Very Large')]
if idx:
    df['Very_Large']= ((df.iloc[:, idx].sum(axis=1))/len(idx)).round(2)
    df.drop(df.columns[idx], axis = 1, inplace = True)

df = df[['Small', 'S-Average', 'M-Average', 'L-Average', 'Very L-Average', 'Large', 'Very_Large', ]]

print (df)

The output of this will be:输出将是:

        Small  S-Average  M-Average  ...  Very L-Average  Large  Very_Large
Words    2.33          3        2.4  ...               7    4.0         7.0
Words1   2.67          4        2.2  ...               8    4.0         8.0
Words2   2.67          3        2.5  ...               6    4.0         9.0
Words3   3.00          2        2.7  ...               9    6.0         6.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM