简体   繁体   中英

Highlight data gaps (NaN) in Matplotlib Scatter Plot

I am plotting some time based data from pandas in matplotlib (can be tens of thousands of rows) and i would like to highlight periods where there are NaNs in the data. The way i though to accomplish this was to use axvspan to draw a red box(es) on the plot starting and stopping where there are data gaps. I did think about just drawing a vertical line each time there was a NaN using axvline, but this could create thousands of objects on the plot and cause the resultant PNG to take a long time to write. So the use of axvspan i think is more appropriate. However where I am stuck is finding the start and stop indices of the groups of NaNs.

The code below isn't from my actual code is just a basic mockup to show what i am trying to achieve.

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt

days = pd.date_range(datetime.now(), datetime.now() + timedelta(13), freq='D')
data = [2,2.3,3,np.nan, np.nan,4.7,3.4,3.1,2.7,np.nan,np.nan,np.nan,4,4.5]
df = pd.DataFrame({'idx': days, 'col': data})
df = df.set_index('idx')
print(df)

#Code to find the start index and stop index of the groups of NaNs
# resuls in list which contains lists of each gap start and stop datetime
gaps = []

plt.plot(df.index, df['col'])

for gap in gaps: 
    plt.axvspan(gap[0], gap[1], facecolor='r', alpha=0.5)

plt.show()

The result would look something like the mockup below: 在此处输入图像描述

Other suggestions for visualizing the gaps would also be appreciated. Such as a straight line in a different color connecting the data across the gap using some sort of fillna?

To find the start and stop indices of the groups of NaNs you can first create a variable to hold the boolean values where the col is NaN . With this variable you can find the rows where there's a transition between valid and NaN values. This can be done using the shift (to dislocate one row on the dataframe) and ne , this way you can compare two consecutive rows and determine where the values alternate. After that, apply cumsum to create distinct groups of contiguous data of valid and NaN values.

Now, using only the rows with NaN values ( df[is_nan] ) use groupby with n_groups to gather the gaps within the same group. Next, apply aggregate to return a single tuple with the start and end timestamps of each group. The use of DateOffset here is to extend the rectangle display to the adjacent points following the desired image output. You can now use ['col'].values to access the dataframe returned by aggregate and convert it into a list.

...
...
df = df.set_index('idx')
print(df)

# Code to find the start index and stop index of the groups of NaNs
is_nan = df['col'].isna()
n_groups = is_nan.ne(is_nan.shift()).cumsum()
gap_list = df[is_nan].groupby(n_groups).aggregate(
    lambda x: (
        x.index[0] + pd.DateOffset(days=-1),
        x.index[-1] + pd.DateOffset(days=+1)
    )
)["col"].values

# resuls in list which contains tuples of each gap start and stop datetime
gaps = gap_list

plt.plot(df.index, df['col'], marker='o' )
plt.xticks(df.index, rotation=45)

for gap in gaps:
    plt.axvspan(gap[0], gap[1], facecolor='r', alpha=0.5)

plt.grid()
plt.show()

plot_nan_gaps

We can use fill_between to highlight areas. However, it is much easier to define the parts where data are than the ones where no data are without creating gaps to existing data points. So, we simply highlight the entire plotting area, then overwrite the areas where data are in white, then plot:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt


days = pd.date_range(datetime.now(), datetime.now() + timedelta(13), freq='D')
data = [2,2.3,3,np.nan, np.nan,4.7,3.4,3.1,2.7,np.nan,np.nan,np.nan,4,4.5]
df = pd.DataFrame({'idx': days, 'col': data})
df = df.set_index('idx')


fig, ax = plt.subplots()
ax.fill_between(df.index, df.col.min(), df.col.max(), where=df.col, facecolor="lightblue", alpha=0.5)
ax.fill_between(df.index, df.col.min(), df.col.max(), where=np.isfinite(df.col), facecolor="white", alpha=1)
ax.plot(df.index, df.col)

ax.xaxis.set_tick_params(rotation=45)
plt.tight_layout()
plt.show()

Sample output:

在此处输入图像描述

You can loop through the enumerated list of boolean values given by df['col'].isna() and compare each boolean value to the previous one to select the timestamps for the starts and stops of the gaps. Here is an example based on your code sample and where the plot is generated with the pandas plotting function :

import numpy as np               # v 1.19.2
import pandas as pd              # v 1.2.3
import matplotlib.pyplot as plt  # v 3.3.4

days = pd.date_range('2021-03-08', periods=14, freq='D')
data = [2,2.3,3,np.nan, np.nan,4.7,3.4,3.1,2.7,np.nan,np.nan,np.nan,4,4.5]
df = pd.DataFrame(dict(col=data), index=days)

ax = df.plot(y='col', marker='.', figsize=(8,4))

# Generate lists of starts and stops timestamps for gaps in time series,
# assuming that the first and last data points are not NaNs
starts, stops = [], []
for idx, isna in enumerate(df['col'].isna()):
    if isna != df['col'].isna()[idx-1] and isna:
        starts.append(df.index[idx-1])
    elif isna != df['col'].isna()[idx-1] and not isna:
        stops.append(df.index[idx])

# Plot red vertical spans for gaps in time series
for start, stop in zip(starts, stops): 
    ax.axvspan(start, stop, facecolor='r', alpha=0.3)

plt.show()

时间间隔

In the end I took a little from column A, B and C from the provided answers, thanks for the feedback. Building the list of start stops was very slow for real world data (tens-hundreds of thousands of rows). Since i didn't need a numerical answer just a visual one i did it using matplotlib alone with the following code:

ax[i].fill_between(data.index, 0, (is_nan*data.max()), color='r', step='mid', linewidth='0')
ax[i].plot(data.index, data, color='b', linestyle='-', marker=',', label=ylabel)

The fill between creates my shaded blocks where the nans are. Multiplying them by the data.max() allows them to span the entire y axis. Step='mid' squares off the sides. Linewidth=0 hides the red line when data is 0 (not NaN).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM