[英]Aggregate values of same name pandas dataframe columns to single column
I have multiple csv files that were produced by tokenizing code. 我有多个通过标记代码生成的csv文件。 These files contain keywords in uppercase and lowercase.
这些文件包含大写和小写的关键字。 I would like to merge all those files in one single dataframe which contains all the unique values (summed) in lowercase.
我想将所有这些文件合并在一个包含小写的所有唯一值(总和)的单个数据框中。 What would you suggest to get the result below?
您建议如何获得以下结果?
Initial DF: 初始DF:
+---+---+----+-----+
| a | b | A | B |
+---+---+----+-----+
| 1 | 2 | 3 | 1 |
| 2 | 1 | 3 | 1 |
+---+---+----+-----+
Result 结果
+---+---+
| a | b |
+---+---+
| 4 | 3 |
| 5 | 2 |
+---+---+
I don't have access to the raw data from which the csv files where created so I cannot correct this at an earlier step. 我无权访问从中创建csv文件的原始数据,因此我无法在较早的步骤进行更正。 At the moment I have tried mapping .lower() to the dataframe headers that I create, but it returns seperate columns with the same name like so:
目前,我尝试将.lower()映射到我创建的数据帧头,但是它返回具有相同名称的单独列,如下所示:
Using pandas is not essential. 使用熊猫不是必需的。 I have thought of converting the csv files to dictionaries and then trying the above procedure (turns out it is much more complicated than I thought), or using lists.
我曾考虑过将csv文件转换成字典,然后尝试上述过程(结果是比我想象的要复杂得多)或使用列表。 Also, group by does not do the job as it will remove non duplicate column names.
另外,group by不会执行此操作,因为它将删除非重复的列名。 Any approach is welcome.
任何方法都欢迎。
The below solution should do: 下面的解决方案应该做:
import pandas as pd
import numpy as np
np.random.seed(seed=1902)
test_df = pd.DataFrame({
# some ways to create random data
'a': np.random.randint(9, size=5),
'b': np.random.randint(9, size=5),
'A': np.random.randint(9, size=5),
'B': np.random.randint(9, size=5),
'c': np.random.randint(9, size=5),
})
sum_df = test_df.copy()
columns_to_keep = set([name.lower() for name in list(test_df)])
for column_name in columns_to_keep:
mutual_columns = [column_name, column_name.upper()]
mutual_columns = [value for value in mutual_columns if value in list(test_df)]
sum_df[column_name] = test_df[mutual_columns].sum(axis=1)
sum_df = sum_df[list(columns_to_keep)]
print("original is:\n", test_df)
print("sum is:\n", sum_df)
producing 生产
original is:
a b A B c
0 2 5 7 2 4
1 1 6 2 3 1
2 0 4 2 4 3
3 6 5 5 7 4
4 1 0 2 7 5
sum is:
a b c
0 9 7 4
1 3 9 1
2 2 8 3
3 11 12 4
4 7 5 3
basically make a list of mutual columns to sum (given by the name of the column and their corresponding upper or lower, respectively) and sum along the rows in correspondence of those ones only. 基本上是创建一个相互依存的列的列表,以求和(分别由列名及其相应的上或下给定),并沿行求和,仅与那些行相对应。
You could iterate through the columns summing those that have the same lowercase representation: 您可以遍历所有具有相同小写字母表示形式的列:
def sumDupeColumns(df):
"""Return dataframe with columns with the same lowercase spelling summed."""
# Get list of unique lowercase column headers
columns = set(map(str.lower, df.columns))
# Create new (zero-initialised) dataframe for output
df1 = pd.DataFrame(data=np.zeros((len(df), len(columns))), columns=columns)
# Sum matching columns
for col in df.columns:
df1[col.lower()] += df[col]
return df1
import pandas as pd
import numpy as np
np.random.seed(seed=42)
# Generate DataFrame with random int input and 'duplicate' columns to sum
df = pd.DataFrame(columns = ['a','A','b','B','Cc','cC','d','eEe','eeE','Eee'],
data = np.random.randint(9, size=(5,10))
df = sumDupeColumns(df)
>>> print(df)
d eee cc a b
0 6.0 14.0 8.0 9.0 11.0
1 7.0 10.0 5.0 14.0 7.0
2 3.0 14.0 8.0 5.0 8.0
3 3.0 17.0 7.0 8.0 12.0
4 0.0 11.0 9.0 5.0 9.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.