I have two questions about dataframe
Questions:1 I need to read a csv file as a DataFrame. The original files as this . I Would like to only read data of 2005~2015. And I need to add 'Country Name' as index. Here is my code:
import pandas as pd
import numpy as np
GPD = (pd.read_csv('world_bank.csv', skiprows=4, index_col='Country Name', usecols=(0,50,51,52,53,54,55,56,57,58,59))
.replace({"Korea, Rep.": "South Korea", "Iran, Islamic Rep.": "Iran","Hong Kong SAR, China": "Hong Kong"}))
But I meet error as:
TypeError: Cannot compare types 'ndarray(dtype=float64)' and 'str'
How do I correct my code?
Questions:2 I need to calculate GDP over the last 10 years for each country. This should be return a Series named avgGDP with 15 countries and their average GDP sorted in descending order.
I have no idea how to do with it.
Would guys could help with this two questions? Thanks.
Consider replacing Country Name values then setting it as index:
GDP = pd.read_csv('world_bank.csv', skiprows=4, usecols=[0] + list(range(50,60)))
GDP['Country Name'] = GDP['Country Name'].replace(to_replace=["Korea, Rep.", "Iran, Islamic Rep.", "Hong Kong SAR, China"],
value=["South Korea", "Iran", "Hong Kong"])
GDP = GDP.set_index('Country Name')
As for the average GDP, consider melting dataframe with melt
into long format and then running groupby
. However, these steps require Country Name as a named column and not index (so commented out):
# GDP = GDP.set_index('Country Name')
mGDP = pd.melt(GDP, id_vars='Country Name', var_name='Year', value_name='GDP')
mGDP = mGDP.groupby(['Country Name'])['GDP'].agg({'AvgGDP':'mean'}).sort_values('AvgGDP', ascending=False)
# AvgGDP
# Country Name
# World 6.735456e+13
# High income 4.600995e+13
# OECD members 4.518469e+13
# Post-demographic dividend 4.298146e+13
# IDA & IBRD total 2.208337e+13
# Europe & Central Asia 2.155501e+13
# Low & middle income 2.135025e+13
# Middle income 2.104455e+13
# IBRD only 2.038699e+13
# European Union 1.748398e+13
# North America 1.723858e+13
# East Asia & Pacific 1.719137e+13
# Upper middle income 1.653071e+13
# United States 1.562297e+13
# Late-demographic dividend 1.441396e+13
# Euro area 1.282312e+13
# East Asia & Pacific (excluding high income) 8.697876e+12
# East Asia & Pacific (IDA & IBRD countries) 8.681259e+12
# Early-demographic dividend 8.418026e+12
# China 6.934351e+12
# Latin America & Caribbean 5.146274e+12
# Japan 4.959066e+12
And some clean up is required since there are continent and region GDP totals and not just countries:
You can do this your way, but I would recommend that you use the pandas_datareader
package for this. You can follow the guide on that web page to install it. Once you've got it installed, you can use the following to get what you want:
from pandas_datareader import wb
import pandas as pd
# Get 2-character ISO country names
all_countries = [x for x in wb.country_codes if len(x) == 2]
df_list = []
for country in all_countries:
try:
df_list.append(wb.download(indicator='NY.GDP.MKTP.CD', country = country, start=2005, end=2015, errors='ignore'))
except ValueError:
pass
GDP = pd.concat(df_list)
print GDP.head()
# NY.GDP.MKTP.CD
# country year
# Andorra 2015 NaN
# 2014 NaN
# 2013 3.248925e+09
# 2012 3.146152e+09
# 2011 3.427023e+09
avgGDP = GDP.groupby(level=0).mean().sort_values(by='NY.GDP.MKTP.CD', ascending = False)
print avgGDP.head()
# NY.GDP.MKTP.CD
# country
# United States 1.539304e+13
# China 6.511771e+12
# Japan 4.923866e+12
# Germany 3.471651e+12
# United Kingdom 2.704811e+12
You can still do it your way, but you may need to do a bit more work:
GDP1 = pd.read_csv('world_bank.csv',
skiprows=4,
index_col ='Country Name',
usecols=(0,50,51,52,53,54,55,56,57,58,59))
GDP1 = GDP1.reset_index().replace({'Country Name': {"Korea, Rep.": "South Korea",
"Iran, Islamic Rep.": "Iran",
"Hong Kong SAR, China": "Hong Kong"}}).set_index('Country Name')
GDP1 = pd.melt(GDP1.reset_index(),
id_vars= 'Country Name',
var_name='Year',
value_name='GDP').set_index(['Country Name','Year'])
avgGDP1 = GDP1.groupby(level=0).mean().sort_values(by='GDP', ascending = False)
print avgGDP1.head()
# GDP
# Country Name
# World 6.735456e+13
# High income 4.600995e+13
# OECD members 4.518469e+13
# Post-demographic dividend 4.298146e+13
# IDA & IBRD total 2.208337e+13
As you can see, it's much harder trying to get the output through downloading the file directly. So the pandas_datareader
package is really handy here.
I hope this helps.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.