简体   繁体   中英

Concatenate data in CSV files with overlapping data in columns

I have a couple CSV files that have vaccine data, such as this:

File 1

Entity,Code,Date,people_vaccinated
Wisconsin,,2021-01-12,125895
Wisconsin,,2021-01-13,125895
Wisconsin,,2021-01-14,135841
Wisconsin,,2021-01-15,151387
Wisconsin,,2021-01-19,188144
Wisconsin,,2021-01-20,193461
Wisconsin,,2021-01-21,204746
Wisconsin,,2021-01-22,221067
Wisconsin,,2021-01-23,241512
Wisconsin,,2021-01-24,260664
Wyoming,,2021-01-12,13577
Wyoming,,2021-01-13,14406
Wyoming,,2021-01-14,17310
Wyoming,,2021-01-15,19931
Wyoming,,2021-01-19,24788
Wyoming,,2021-01-20,25841
Wyoming,,2021-01-21,25841
Wyoming,,2021-01-22,29993
Wyoming,,2021-01-23,32746
Wyoming,,2021-01-24,35868

File 2

Entity,Code,Date,people_fully_vaccinated
Wisconsin,,2021-01-12,11343
Wisconsin,,2021-01-13,11343
Wisconsin,,2021-01-15,17108
Wisconsin,,2021-01-19,23641
Wisconsin,,2021-01-20,27312
Wisconsin,,2021-01-21,32268
Wisconsin,,2021-01-22,37901
Wisconsin,,2021-01-23,42229
Wisconsin,,2021-01-24,45641
Wyoming,,2021-01-12,2116
Wyoming,,2021-01-13,2559
Wyoming,,2021-01-15,2803
Wyoming,,2021-01-19,3242
Wyoming,,2021-01-20,3441
Wyoming,,2021-01-21,3441
Wyoming,,2021-01-22,4515
Wyoming,,2021-01-23,4773
Wyoming,,2021-01-24,4895

Not all the data (specifically dates going with locations) overlaps, but for the ones that do, how would I combine the last column? I'm guessing using pandas would be best, but I don't want to get stuck messing with a bunch of nested loops.

If you are trying to merge file1 with file2 only for the records in file1 then solution:

import pandas as pd
## suppose file1_df and file2_df are related Dataframe object for file1 and file2 respectively.
merged_df = pd.merge(file1_df, file2_df, how='left' on=['Entity','Code','Date'])

Note: if you are familiar with set operations, you can do right outer joint, left joint, inner joint, and full outer join changing how parameter in the above function call. reference

import pandas as pd
data1 = pd.read_csv('file1.csv') # path of file1
data2 = pd.read_csv('file2.csv') # path of file2
data1['Code'] = data1['Code'].fillna(0) # replace Nan with 0
data2['Code'] = data2['Code'].fillna(0) # replace Nan with 0
combined_data = data1.append(data2,ignore_index=True) # since both the file have same column so we append one in another
result = combined_data.groupby(['Entity','Code','Date'], as_index=False)['people_vaccinated'].sum() # group by column and add people who got vaccinated based on same location and date and code
print(result)

Entity:        Code:  Date:      people_vaccinated
0   Wisconsin   0.0 12-01-2021  137238
1   Wisconsin   0.0 13-01-2021  137238
2   Wisconsin   0.0 14-01-2021  135841
3   Wisconsin   0.0 15-01-2021  168495
4   Wisconsin   0.0 19-01-2021  211785
5   Wisconsin   0.0 20-01-2021  220773
6   Wisconsin   0.0 21-01-2021  237014
7   Wisconsin   0.0 22-01-2021  258968
8   Wisconsin   0.0 23-01-2021  283741
9   Wisconsin   0.0 24-01-2021  306305
10  Wyoming     0.0 12-01-2021  15693
11  Wyoming     0.0 13-01-2021  16965
12  Wyoming     0.0 14-01-2021  17310
13  Wyoming     0.0 15-01-2021  22734
14  Wyoming     0.0 19-01-2021  28030
15  Wyoming     0.0 20-01-2021  29282
16  Wyoming     0.0 21-01-2021  29282
17  Wyoming     0.0 22-01-2021  34508
18  Wyoming     0.0 23-01-2021  37519
19  Wyoming     0.0 24-01-2021  40763

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM