I want to track new covid19 case number in each establishments in the company, which is daily time series. I'd like to see how new cases of covid19 can be tracked by realtime with nice EDA plot. I tried matplotlib
to make histogram plot for each company in one page but couldn't make correct one. Can anyone point me out how to get this right? Any thoughts?
reproducible data :
Here is the reproducible covid19 tracking time series data in this gist . In this data, est
is refers to establishment code
, so each different company might have multiple establishment codes.
my attempt
Here is my attempt with seaborns and matplotlib:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from datetime import timedelta, datetime
bigdf = pd.read_csv("coviddf.csv")
markers = {"new_case_sum": "s", "est_company": "X"}
for t in bigdf.company.unique():
grouped = bigdf[bigdf.company==t]
res = grouped.groupby(['run_date','county-state', 'company'])['new'].sum().unstack().reset_index('run_date')
f, axes = plt.subplots(nrows=len(bigdf.company), ncols= 1, figsize=(20, 7), squeeze=False)
for j in range(len(bigdf.company)):
p = sns.scatterplot('run_date', 'new', data=res, hue='company', markers=markers, style='cats', ax=axes[j, 0])
p.set_title(f'Threshold: {t}\n{pt}')
p.set_xlim(data['run_date'].min() - timedelta(days=60), data['run_date'].max() + timedelta(days=60))
plt.legend(bbox_to_anchor=(1.04, 0.5), loc="center left", borderaxespad=0)
but I couldn't get correct plot. I think I made correct data aggregation for plotting data but somehow I used wrong data attributes to render plot. Can anyone suggest me where's my mistake? Can anyone suggest better approach to make this happen? Any idea?
desired plot
ideally, I want to render plot something like this structure (attached desired plot is just reference from other site):
Can anyone suggest how to make my above approach right? any better suggestion to make better time series plot for covid tracking? Thanks
update :
in my attempt, I tried to aggregate new case number by all establishments in each company then make linechart or histogram. How can we make linechart where all confirmed, death, and new cases of all establishment (aka, est
column) in each company along the date in one page plot? Any idea to make this happen?
sns.FacetGrid
andsns.barplot
company
and each column will be a barplot
for each est
.
run_date
. I added extra data so there would be two dates.hue
, will be the val
for new
, confirmed
, and dead
. .stack
is used on groupby
to stack new
, confirmed
, and dead
into one column. import pandas as pd
import seaborn as sns
# load and clean data
df = pd.read_csv("https://gist.githubusercontent.com/jerry-shad/318595505684ea4248a6cc0949788d33/raw/31bbeb08f329b4b96605b8f2a48f6c74c3e0b594/coviddf.csv")
df.drop(columns=['Unnamed: 0'], inplace=True) # drop this extra column
df.run_date = pd.to_datetime(df.run_date) # set run_date to a datetime format
# plot
for g, d in df.groupby(['company']):
data = d.groupby(['run_date','county-state', 'company', 'est'], as_index=True).agg({'new': sum, 'confirmed': sum, 'death': sum}).stack().reset_index().rename(columns={'level_4': 'type', 0: 'val'})
# display(data) # if you're not using Jupyter, change display to print
# print('\n')
print(f'{g}')
g = sns.FacetGrid(data, col='est', sharex=False, sharey=False, height=5, col_wrap=4)
g.map(sns.barplot, 'run_date', 'val', 'type', order=data.run_date.dt.date.unique(), hue_order=data['type'].unique())
g.add_legend()
g.set_xticklabels(rotation=90)
g.set(yscale='log')
plt.tight_layout()
plt.show()
groupby
example for Vergin
run_date county-state company est type val
0 2020-08-30 ColfaxNebraska Vergin 86M new 2
1 2020-08-30 ColfaxNebraska Vergin 86M confirmed 718
2 2020-08-30 ColfaxNebraska Vergin 86M death 5
3 2020-08-30 FordKansas Vergin 86K new 0
4 2020-08-30 FordKansas Vergin 86K confirmed 2178
5 2020-08-30 FordKansas Vergin 86K death 10
6 2020-08-30 FresnoCalifornia Vergin 354 new 0
7 2020-08-30 FresnoCalifornia Vergin 354 confirmed 23932
8 2020-08-30 FresnoCalifornia Vergin 354 death 239
9 2020-08-30 MorganColorado Vergin 86R new 1
10 2020-08-30 MorganColorado Vergin 86R confirmed 711
11 2020-08-30 MorganColorado Vergin 86R death 48
12 2020-08-30 ParmerTexas Vergin 86E new 1
13 2020-08-30 ParmerTexas Vergin 86E confirmed 381
14 2020-08-30 ParmerTexas Vergin 86E death 7
import pandas as pd
import plotly.express as px
# load and clean data
df = pd.read_csv("https://gist.githubusercontent.com/jerry-shad/318595505684ea4248a6cc0949788d33/raw/31bbeb08f329b4b96605b8f2a48f6c74c3e0b594/coviddf.csv")
df.drop(columns=['Unnamed: 0'], inplace=True) # drop this extra column
df.run_date = pd.to_datetime(df.run_date) # set run_date to a datetime format
# convert to long form
dfl = df.set_index(['company', 'est', 'latitude', 'longitude'])[['confirmed', 'new', 'death']].stack().reset_index().rename(columns={'level_4': 'type', 0: 'vals'})
# plot
fig = px.scatter_geo(dfl,
lon='longitude',
lat='latitude',
color="type", # which column to use to set the color of markers
hover_name="company", # column added to hover information
size="vals", # size of markers
projection="albers usa")
fig.show()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.