简体   繁体   English

将同名pandas数据框列的值聚合到单列

[英]Aggregate values of same name pandas dataframe columns to single column

I have multiple csv files that were produced by tokenizing code. 我有多个通过标记代码生成的csv文件。 These files contain keywords in uppercase and lowercase. 这些文件包含大写和小写的关键字。 I would like to merge all those files in one single dataframe which contains all the unique values (summed) in lowercase. 我想将所有这些文件合并在一个包含小写的所有唯一值(总和)的单个数据框中。 What would you suggest to get the result below? 您建议如何获得以下结果?

Initial DF: 初始DF:

+---+---+----+-----+
| a | b |  A |  B  |
+---+---+----+-----+
| 1 | 2 |  3 |   1 |
| 2 | 1 |  3 |   1 |
+---+---+----+-----+

Result 结果

+---+---+
| a | b |
+---+---+
| 4 | 3 |
| 5 | 2 |
+---+---+

I don't have access to the raw data from which the csv files where created so I cannot correct this at an earlier step. 我无权访问从中创建csv文件的原始数据,因此我无法在较早的步骤进行更正。 At the moment I have tried mapping .lower() to the dataframe headers that I create, but it returns seperate columns with the same name like so: 目前,我尝试将.lower()映射到我创建的数据帧头,但是它返回具有相同名称的单独列,如下所示:

.lower()合并后

Using pandas is not essential. 使用熊猫不是必需的。 I have thought of converting the csv files to dictionaries and then trying the above procedure (turns out it is much more complicated than I thought), or using lists. 我曾考虑过将csv文件转换成字典,然后尝试上述过程(结果是比我想象的要复杂得多)或使用列表。 Also, group by does not do the job as it will remove non duplicate column names. 另外,group by不会执行此操作,因为它将删除非重复的列名。 Any approach is welcome. 任何方法都欢迎。

The below solution should do: 下面的解决方案应该做:

import pandas as pd
import numpy as np 

np.random.seed(seed=1902)

test_df = pd.DataFrame({
    # some ways to create random data
    'a': np.random.randint(9, size=5),
    'b': np.random.randint(9, size=5),
    'A': np.random.randint(9, size=5),
    'B': np.random.randint(9, size=5),
    'c': np.random.randint(9, size=5),
})

sum_df = test_df.copy()
columns_to_keep = set([name.lower() for name in list(test_df)])

for column_name in columns_to_keep:
    mutual_columns = [column_name, column_name.upper()]
    mutual_columns = [value for value in mutual_columns if value in list(test_df)]
    sum_df[column_name] = test_df[mutual_columns].sum(axis=1)

sum_df = sum_df[list(columns_to_keep)]
print("original is:\n", test_df)
print("sum is:\n", sum_df)

producing 生产

original is:
    a  b  A  B  c
0  2  5  7  2  4
1  1  6  2  3  1
2  0  4  2  4  3
3  6  5  5  7  4
4  1  0  2  7  5

sum is:
     a   b  c
0   9   7  4
1   3   9  1
2   2   8  3
3  11  12  4
4   7  5   3

basically make a list of mutual columns to sum (given by the name of the column and their corresponding upper or lower, respectively) and sum along the rows in correspondence of those ones only. 基本上是创建一个相互依存的列的列表,以求和(分别由列名及其相应的上或下给定),并沿行求和,仅与那些行相对应。

Code: 码:

You could iterate through the columns summing those that have the same lowercase representation: 您可以遍历所有具有相同小写字母表示形式的列:

def sumDupeColumns(df):
    """Return dataframe with columns with the same lowercase spelling summed."""

    # Get list of unique lowercase column headers
    columns = set(map(str.lower, df.columns))
    # Create new (zero-initialised) dataframe for output
    df1 = pd.DataFrame(data=np.zeros((len(df), len(columns))), columns=columns)

    # Sum matching columns
    for col in df.columns:
        df1[col.lower()] += df[col]

    return df1

Example: 例:

import pandas as pd
import numpy as np

np.random.seed(seed=42)

# Generate DataFrame with random int input and 'duplicate' columns to sum
df = pd.DataFrame(columns = ['a','A','b','B','Cc','cC','d','eEe','eeE','Eee'], 
                  data = np.random.randint(9, size=(5,10))

df = sumDupeColumns(df)

>>> print(df)

     d   eee   cc     a     b
0  6.0  14.0  8.0   9.0  11.0
1  7.0  10.0  5.0  14.0   7.0
2  3.0  14.0  8.0   5.0   8.0
3  3.0  17.0  7.0   8.0  12.0
4  0.0  11.0  9.0   5.0   9.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将多个 Pandas DataFrame 列设置为单个列中的值或同时设置多个标量值 - set multiple Pandas DataFrame columns to values in a single column or multiple scalar values at the same time 通过熊猫中的字符串列编号名称聚合列值 - Aggregate columns values by string column numerated name in pandas 将 2 列中的值合并为 Pandas 数据框中的单列 - Coalesce values from 2 columns into a single column in a pandas dataframe 计算多列熊猫数据框中的聚合值 - Calculating aggregate values in a pandas dataframe with multiple columns Pandas-将具有多列的数据框重塑/转换为值的单列 - Pandas - Reshape / Transform Dataframe with Multiple Columns into a Single Column of values 根据Pandas DataFrame中单个列中的值创建多个列 - Create multiple columns based on values in single column in Pandas DataFrame 聚合 pandas dataframe 与相同的值一致 - Aggregate pandas dataframe consistently with the same values 比较 2 个 pandas 数据框列并根据值是否相同创建新列 - Comparing 2 pandas dataframe columns and creating new column based on if the values are same or not 按列聚合 pandas dataframe - Aggregate pandas dataframe by a column 如何根据列的值(列的名称不同)从 pandas dataframe 中删除重复的列? - How to drop duplicates columns from a pandas dataframe, based on columns' values (columns don't have the same name)?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM