I have data for many countries over a period of time (2001-2003). It looks something like this:
index | year | country | inflation | GDP |
---|---|---|---|---|
1 | 2001 | AFG | nan | 48 |
2 | 2002 | AFG | nan | 49 |
3 | 2003 | AFG | nan | 50 |
4 | 2001 | CHI | 3.0 | nan |
5 | 2002 | CHI | 5.0 | nan |
6 | 2003 | CHI | 7.0 | nan |
7 | 2001 | USA | nan | 220 |
8 | 2002 | USA | 4.0 | 250 |
9 | 2003 | USA | 2.5 | 280 |
I want to drop countries in case there is no data (ie values are missing for all years) for any given variable.
In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.
What's the best way to do that?
This should work by filtering all values that have nan in one of (inflation, GDP):
(
df.groupby(['country'])
.filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)
Note, if you have more than two columns you can work on a more general version of this:
df.groupby(['country']).filter(lambda x: not x.isnull().all().any())
You can also try this:
# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()
# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])
# keep the rows contains the countries
df = df.drop(df[~df.country.isin(final_countries)].index)
You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.
To convert from long to wide, you can use pivot functions . See this question too .
Here's code for dropping nulls, after its reshaped:
df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null
To convert back to long, you can use pd.melt.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.