[英]How to merge multiple raw input CSV's with pandas containing similar columns with slightly different names?
I wrote some code to combine multiple CSV's that are interpretered with Pandas and appended to one combined CSV.我编写了一些代码来组合使用 Pandas 解释并附加到一个组合 CSV 的多个 CSV。
The issue I have is that the CSV files are delivered by multiple parties (monthly) and often contain differences with regard to column names, while they essentially contain the same information.我遇到的问题是 CSV 文件由多方(每月)交付,并且通常包含有关列名的差异,而它们本质上包含相同的信息。 For instance:
例如:
CSV 1 | CSV 1 | ID |
身份证 | Instance number |
实例编号 | |
| -------- |
-------- | -------------- |
-------------- | |
| 1 |
1 | 401421 |
401421 | |
| 2 |
2 | 420138 |
420138 |
CSV 2 | CSV 2 | ID |
身份证 | Instance NO |
实例编号 | |
| -------- |
-------- | -------------- |
-------------- | |
| 1 |
1 | 482012 |
482012 | |
| 2 |
2 | 465921 |
465921 |
This will result in two columns in the combined file, Instance Number & Instance NO unless I rename the column beforehand while the idea is to automatically process all files without intervention beforehand.这将导致组合文件中有两列,Instance Number & Instance NO,除非我事先重命名该列,而想法是自动处理所有文件而无需事先干预。
The solution that should work is to use combine_first or fillna, but next time the column may be entered as eg Instance No/number.应该工作的解决方案是使用combine_first 或fillna,但下一次可以输入列,例如实例编号/编号。
Since improving data delivery isn't an option, is there any smart way to solve issues like this without having to write out all possible variations and remap them to one leading column?由于改进数据交付不是一种选择,有没有什么聪明的方法可以解决这样的问题,而不必写出所有可能的变化并将它们重新映射到一个前导列?
Thanks in advance!提前致谢!
I think first you need to have a dictionary of all possible names or you can quickly add those whenever you get a new one and rename the column names.我认为首先您需要有一个包含所有可能名称的字典,或者您可以在获得新名称时快速添加这些名称并重命名列名。 for example
例如
general_dict = { 'SLNO': ['Sl No', 'SNo']}
col_list = all_df.columns.to_list()
rename_dict = {}
for col in col_list:
for key, val in general_dict.items():
if col in val:
rename_dict[col] = key
break
all_df.rename(columns=rename_dict, inplace=True)
The short answer is no, as your asking the computer to think for itself.简短的回答是否定的,因为您要求计算机自己思考。 You do however have multiple options to deal with common scenarios.
但是,您确实有多种选择来处理常见情况。
If the column order and/or positions are fixed you can make use of the header=0, names=['ID', 'Instance']
to ignore the headers sent in the file and make use of known data如果列顺序和/或位置是固定的,您可以使用
header=0, names=['ID', 'Instance']
忽略文件中发送的标题并使用已知数据
You can also generate a config file that maps all possible wrong header names to the right one您还可以生成一个配置文件,将所有可能的错误 header 名称映射到正确的名称
If the columns are in the same order in all the files, you could try out like this,如果所有文件中的列顺序相同,您可以尝试这样,
data1 = pd.read_csv('data/data1.csv')
data2 = pd.read_csv('data/data2.csv')
data1.columns = ['A', 'B', 'C']
data2.columns = ['A', 'B', 'C']
pd.concat([data1, data2], axis=0)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.