简体   繁体   中英

How to use pd.read_csv to filter columns in Pandas

I have two questions about dataframe

Questions:1 I need to read a csv file as a DataFrame. The original files as this . I Would like to only read data of 2005~2015. And I need to add 'Country Name' as index. Here is my code:

import pandas as pd
import numpy as np
GPD = (pd.read_csv('world_bank.csv', skiprows=4, index_col='Country Name', usecols=(0,50,51,52,53,54,55,56,57,58,59))
  .replace({"Korea, Rep.": "South Korea", "Iran, Islamic Rep.": "Iran","Hong Kong SAR, China": "Hong Kong"}))

But I meet error as:

TypeError: Cannot compare types 'ndarray(dtype=float64)' and 'str'

How do I correct my code?

Questions:2 I need to calculate GDP over the last 10 years for each country. This should be return a Series named avgGDP with 15 countries and their average GDP sorted in descending order.

I have no idea how to do with it.

Would guys could help with this two questions? Thanks.

Consider replacing Country Name values then setting it as index:

GDP = pd.read_csv('world_bank.csv', skiprows=4, usecols=[0] + list(range(50,60)))

GDP['Country Name'] = GDP['Country Name'].replace(to_replace=["Korea, Rep.", "Iran, Islamic Rep.", "Hong Kong SAR, China"],
                                                  value=["South Korea", "Iran", "Hong Kong"])    
GDP = GDP.set_index('Country Name')

As for the average GDP, consider melting dataframe with melt into long format and then running groupby . However, these steps require Country Name as a named column and not index (so commented out):

# GDP = GDP.set_index('Country Name')

mGDP = pd.melt(GDP, id_vars='Country Name', var_name='Year', value_name='GDP')

mGDP = mGDP.groupby(['Country Name'])['GDP'].agg({'AvgGDP':'mean'}).sort_values('AvgGDP', ascending=False)

#                                                           AvgGDP
# Country Name                                                    
# World                                               6.735456e+13
# High income                                         4.600995e+13
# OECD members                                        4.518469e+13
# Post-demographic dividend                           4.298146e+13
# IDA & IBRD total                                    2.208337e+13
# Europe & Central Asia                               2.155501e+13
# Low & middle income                                 2.135025e+13
# Middle income                                       2.104455e+13
# IBRD only                                           2.038699e+13
# European Union                                      1.748398e+13
# North America                                       1.723858e+13
# East Asia & Pacific                                 1.719137e+13
# Upper middle income                                 1.653071e+13
# United States                                       1.562297e+13
# Late-demographic dividend                           1.441396e+13
# Euro area                                           1.282312e+13
# East Asia & Pacific (excluding high income)         8.697876e+12
# East Asia & Pacific (IDA & IBRD countries)          8.681259e+12
# Early-demographic dividend                          8.418026e+12
# China                                               6.934351e+12
# Latin America & Caribbean                           5.146274e+12
# Japan                                               4.959066e+12

And some clean up is required since there are continent and region GDP totals and not just countries:

You can do this your way, but I would recommend that you use the pandas_datareader package for this. You can follow the guide on that web page to install it. Once you've got it installed, you can use the following to get what you want:

from pandas_datareader import wb
import pandas as pd


# Get 2-character ISO country names
all_countries = [x for x in wb.country_codes if len(x) == 2]

df_list = []

for country in all_countries:
    try:
        df_list.append(wb.download(indicator='NY.GDP.MKTP.CD', country = country, start=2005, end=2015, errors='ignore'))
    except ValueError:
        pass


GDP = pd.concat(df_list)

print GDP.head()

#               NY.GDP.MKTP.CD
# country year                
# Andorra 2015             NaN
#         2014             NaN
#         2013    3.248925e+09
#         2012    3.146152e+09
#         2011    3.427023e+09

avgGDP = GDP.groupby(level=0).mean().sort_values(by='NY.GDP.MKTP.CD', ascending = False)

print avgGDP.head()

#                 NY.GDP.MKTP.CD
# country                       
# United States     1.539304e+13
# China             6.511771e+12
# Japan             4.923866e+12
# Germany           3.471651e+12
# United Kingdom    2.704811e+12

You can still do it your way, but you may need to do a bit more work:

GDP1 = pd.read_csv('world_bank.csv', 
    skiprows=4, 
    index_col ='Country Name', 
    usecols=(0,50,51,52,53,54,55,56,57,58,59))

GDP1 = GDP1.reset_index().replace({'Country Name': {"Korea, Rep.": "South Korea", 
    "Iran, Islamic Rep.": "Iran",
    "Hong Kong SAR, China": "Hong Kong"}}).set_index('Country Name')

GDP1 = pd.melt(GDP1.reset_index(), 
    id_vars= 'Country Name', 
    var_name='Year', 
    value_name='GDP').set_index(['Country Name','Year'])

avgGDP1 = GDP1.groupby(level=0).mean().sort_values(by='GDP', ascending = False)

print avgGDP1.head()

#                                     GDP
# Country Name                           
# World                      6.735456e+13
# High income                4.600995e+13
# OECD members               4.518469e+13
# Post-demographic dividend  4.298146e+13
# IDA & IBRD total           2.208337e+13

As you can see, it's much harder trying to get the output through downloading the file directly. So the pandas_datareader package is really handy here.

I hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM