简体   繁体   English

基于多列在具有不同形状的两个数据帧之间减去多列

[英]Subtract multiple columns between two dataframes with different shapes based on multiple columns

I'm looking at the following three datasets from JHU我正在查看 JHU 的以下三个数据集

https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv

https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv

https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv

Which are on the form表格上有哪些

 'Province/State   'Country/Region    'Lat'    'Long'   '1/22/20'    '1/23/20' ...
       NaN               Italy          x        y          0            0

I want to calculate the number of active cases per province,country and day based on formula active = confirmed - (recovered+deahts)我想根据公式active = confirmed - (recovered+deahts)计算每个省、国家和天的活动病例数

Before the datasets had the same shape, so I could do the following在数据集具有相同形状之前,我可以执行以下操作

df_active = df_confirmed.copy()
df_active.loc[4:] = df_confirmed.loc[4:]-(df_recovered.loc[4:]+df_deaths.loc[4:])

Now they do not contain data on the same countries, and do not always have the same amount of date columns.现在它们不包含相同国家/地区的数据,也不总是具有相同数量的日期列。

So I need to do the following所以我需要做以下事情

1) Determine what date columns all 3 DF have in common, 1) 确定所有 3 个 DF 的共同日期列,

2) Where the province and country column match, do active = confirmed - (recovered+deahts) 2) 在省和国家栏匹配的地方,do active = confirmed - (recovered+deahts)

For point 1) I can do the following对于第 1 点)我可以执行以下操作

## append all shape[1] to list
df_shape_list.append(df_confirmed.shape[1])
...  
min_common_columns = min(df_shape_list)

So I need to subtract columns 4:min_common_columns , but how do I do that where province and country column match on all 3 DF's?所以我需要减去列4:min_common_columns ,但是我如何在所有 3 个 DF 上的省和国家列匹配的情况下做到这一点?

Consider melt to transform their wide data into long format then merge on location and date.考虑melt到其广泛的数据转换为长格式,然后merge的地点和日期。 Then run needed formula:然后运行所需的公式:

from functools import reduce
import pandas as pd

df_confirmed = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/"
                           "csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")

df_deaths = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/"
                        "csse_covid_19_time_series/time_series_covid19_deaths_global.csv")

df_recovered = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/"
                           "csse_covid_19_time_series/time_series_covid19_recovered_global.csv")


# MELT EACH DF IN LIST COMPREHENSION
df_list = [df.melt(id_vars = ['Province/State', 'Country/Region', 'Lat', 'Long'],
                   var_name = 'Date', value_name = val) 
           for df, val in zip([df_confirmed, df_deaths, df_recovered], 
                              ['confirmed', 'deaths', 'recovered'])]

# CHAIN MERGE
df_long = reduce(lambda x,y: pd.merge(x, y, on=['Province/State', 'Country/Region', 'Lat', 'Long', 'Date']),
                 df_list)

# SIMPLE ARITHMETIC
df_long['active'] = df_long['confirmed'] - (df_long['recovered'] + df_long['deaths'])

Output (sorted by active descending)输出(按主动降序排序)

df_long.sort_values(['active'], ascending=False).head(10)

#       Province/State Country/Region      Lat     Long     Date  confirmed  deaths  recovered  active
# 15229            NaN             US  37.0902 -95.7129  3/27/20     101657    1581        869   99207
# 14998            NaN             US  37.0902 -95.7129  3/26/20      83836    1209        681   81946
# 15141            NaN          Italy  43.0000  12.0000  3/27/20      86498    9134      10950   66414
# 14767            NaN             US  37.0902 -95.7129  3/25/20      65778     942        361   64475
# 14910            NaN          Italy  43.0000  12.0000  3/26/20      80589    8215      10361   62013
# 14679            NaN          Italy  43.0000  12.0000  3/25/20      74386    7503       9362   57521
# 14448            NaN          Italy  43.0000  12.0000  3/24/20      69176    6820       8326   54030
# 14536            NaN             US  37.0902 -95.7129  3/24/20      53740     706        348   52686
# 15205            NaN          Spain  40.0000  -4.0000  3/27/20      65719    5138       9357   51224
# 14217            NaN          Italy  43.0000  12.0000  3/23/20      63927    6077       7024   50826

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM