[英]Subtract multiple columns between two dataframes with different shapes based on multiple columns
I'm looking at the following three datasets from JHU我正在查看 JHU 的以下三个数据集
https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv
https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv
Which are on the form表格上有哪些
'Province/State 'Country/Region 'Lat' 'Long' '1/22/20' '1/23/20' ...
NaN Italy x y 0 0
I want to calculate the number of active cases per province,country and day based on formula active = confirmed - (recovered+deahts)
我想根据公式
active = confirmed - (recovered+deahts)
计算每个省、国家和天的活动病例数
Before the datasets had the same shape, so I could do the following在数据集具有相同形状之前,我可以执行以下操作
df_active = df_confirmed.copy()
df_active.loc[4:] = df_confirmed.loc[4:]-(df_recovered.loc[4:]+df_deaths.loc[4:])
Now they do not contain data on the same countries, and do not always have the same amount of date columns.现在它们不包含相同国家/地区的数据,也不总是具有相同数量的日期列。
So I need to do the following所以我需要做以下事情
1) Determine what date columns all 3 DF have in common, 1) 确定所有 3 个 DF 的共同日期列,
2) Where the province and country column match, do active = confirmed - (recovered+deahts)
2) 在省和国家栏匹配的地方,do
active = confirmed - (recovered+deahts)
For point 1) I can do the following对于第 1 点)我可以执行以下操作
## append all shape[1] to list
df_shape_list.append(df_confirmed.shape[1])
...
min_common_columns = min(df_shape_list)
So I need to subtract columns 4:min_common_columns
, but how do I do that where province and country column match on all 3 DF's?所以我需要减去列
4:min_common_columns
,但是我如何在所有 3 个 DF 上的省和国家列匹配的情况下做到这一点?
Consider melt
to transform their wide data into long format then merge
on location and date.考虑
melt
到其广泛的数据转换为长格式,然后merge
的地点和日期。 Then run needed formula:然后运行所需的公式:
from functools import reduce
import pandas as pd
df_confirmed = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/"
"csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")
df_deaths = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/"
"csse_covid_19_time_series/time_series_covid19_deaths_global.csv")
df_recovered = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/"
"csse_covid_19_time_series/time_series_covid19_recovered_global.csv")
# MELT EACH DF IN LIST COMPREHENSION
df_list = [df.melt(id_vars = ['Province/State', 'Country/Region', 'Lat', 'Long'],
var_name = 'Date', value_name = val)
for df, val in zip([df_confirmed, df_deaths, df_recovered],
['confirmed', 'deaths', 'recovered'])]
# CHAIN MERGE
df_long = reduce(lambda x,y: pd.merge(x, y, on=['Province/State', 'Country/Region', 'Lat', 'Long', 'Date']),
df_list)
# SIMPLE ARITHMETIC
df_long['active'] = df_long['confirmed'] - (df_long['recovered'] + df_long['deaths'])
Output (sorted by active descending)输出(按主动降序排序)
df_long.sort_values(['active'], ascending=False).head(10)
# Province/State Country/Region Lat Long Date confirmed deaths recovered active
# 15229 NaN US 37.0902 -95.7129 3/27/20 101657 1581 869 99207
# 14998 NaN US 37.0902 -95.7129 3/26/20 83836 1209 681 81946
# 15141 NaN Italy 43.0000 12.0000 3/27/20 86498 9134 10950 66414
# 14767 NaN US 37.0902 -95.7129 3/25/20 65778 942 361 64475
# 14910 NaN Italy 43.0000 12.0000 3/26/20 80589 8215 10361 62013
# 14679 NaN Italy 43.0000 12.0000 3/25/20 74386 7503 9362 57521
# 14448 NaN Italy 43.0000 12.0000 3/24/20 69176 6820 8326 54030
# 14536 NaN US 37.0902 -95.7129 3/24/20 53740 706 348 52686
# 15205 NaN Spain 40.0000 -4.0000 3/27/20 65719 5138 9357 51224
# 14217 NaN Italy 43.0000 12.0000 3/23/20 63927 6077 7024 50826
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.