简体   繁体   English

如何将多个原始输入 CSV 与包含名称略有不同的相似列的 pandas 合并?

[英]How to merge multiple raw input CSV's with pandas containing similar columns with slightly different names?

I wrote some code to combine multiple CSV's that are interpretered with Pandas and appended to one combined CSV.我编写了一些代码来组合使用 Pandas 解释并附加到一个组合 CSV 的多个 CSV。

The issue I have is that the CSV files are delivered by multiple parties (monthly) and often contain differences with regard to column names, while they essentially contain the same information.我遇到的问题是 CSV 文件由多方(每月)交付,并且通常包含有关列名的差异,而它们本质上包含相同的信息。 For instance:例如:

CSV 1 | CSV 1 | ID |身份证 | Instance number |实例编号 | | | -------- | -------- | -------------- | -------------- | | | 1 | 1 | 401421 | 401421 | | | 2 | 2 | 420138 | 420138 |

CSV 2 | CSV 2 | ID |身份证 | Instance NO |实例编号 | | | -------- | -------- | -------------- | -------------- | | | 1 | 1 | 482012 | 482012 | | | 2 | 2 | 465921 | 465921 |

This will result in two columns in the combined file, Instance Number & Instance NO unless I rename the column beforehand while the idea is to automatically process all files without intervention beforehand.这将导致组合文件中有两列,Instance Number & Instance NO,除非我事先重命名该列,而想法是自动处理所有文件而无需事先干预。

The solution that should work is to use combine_first or fillna, but next time the column may be entered as eg Instance No/number.应该工作的解决方案是使用combine_first 或fillna,但下一次可以输入列,例如实例编号/编号。

Since improving data delivery isn't an option, is there any smart way to solve issues like this without having to write out all possible variations and remap them to one leading column?由于改进数据交付不是一种选择,有没有什么聪明的方法可以解决这样的问题,而不必写出所有可能的变化并将它们重新映射到一个前导列?

Thanks in advance!提前致谢!

I think first you need to have a dictionary of all possible names or you can quickly add those whenever you get a new one and rename the column names.我认为首先您需要有一个包含所有可能名称的字典,或者您可以在获得新名称时快速添加这些名称并重命名列名。 for example例如

general_dict = { 'SLNO': ['Sl No', 'SNo']}

col_list = all_df.columns.to_list()
rename_dict = {}

for col in col_list:
    for key, val in general_dict.items():
        if col in val:
            rename_dict[col] = key

            break
all_df.rename(columns=rename_dict, inplace=True)

The short answer is no, as your asking the computer to think for itself.简短的回答是否定的,因为您要求计算机自己思考。 You do however have multiple options to deal with common scenarios.但是,您确实有多种选择来处理常见情况。

If the column order and/or positions are fixed you can make use of the header=0, names=['ID', 'Instance'] to ignore the headers sent in the file and make use of known data如果列顺序和/或位置是固定的,您可以使用header=0, names=['ID', 'Instance']忽略文件中发送的标题并使用已知数据

You can also generate a config file that maps all possible wrong header names to the right one您还可以生成一个配置文件,将所有可能的错误 header 名称映射到正确的名称

If the columns are in the same order in all the files, you could try out like this,如果所有文件中的列顺序相同,您可以尝试这样,

  1. pre-define columns in the first place首先预定义列
  2. change the column names for all the file in the first place itself and concat the dataframes首先更改所有文件的列名并连接数据框
data1 = pd.read_csv('data/data1.csv')
data2 = pd.read_csv('data/data2.csv')

data1.columns = ['A', 'B', 'C']
data2.columns = ['A', 'B', 'C']

pd.concat([data1, data2], axis=0)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas:合并具有相似名称的列 - Pandas: merge columns with the similar names 如何在不丢失数据的情况下合并 Pandas Dataframe 中具有相似名称的多个列 - How do I merge multiple columns with similar names in a Pandas Dataframe without losing data 如何合并 pandas dataframe 中具有相似名称的列? - How do I merge columns that have similar names in a pandas dataframe? 熊猫-如何拆分和合并名称相似的列? - Pandas- how to split and merge columns with similar names? Pandas 合并不同名称的列 - Pandas merge columns with different names 熊猫:CSV输入的列与“名称”字段中定义的列不同 - Pandas: csv input with columns different than the ones defines in “names” field 大熊猫合并包含相同信息但列名稍有不同的列 - Pandas merging columns that contain the same information, but slightly different column names 如何使用 Python 和 Pandas 将具有相似和不同列的多个 CSV 文件合并为 1? - How to consolidate multiple CSV files with similar and different columns into 1 using Python and Pandas? 如何在多列上合并,然后如果没有匹配项,则在 Pandas 中的不同列上合并? - How to merge on multiple columns and then if there is not a match, merge on different columns in pandas? 如何读取具有多个具有相同或相似名称的列的 CSV 文件? - How to read CSV files having multiple columns with same or similar names?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM