简体   繁体   中英

Slicing dataframe with subset of columns

I'm a beginner to python and trying to set instance of dataframe with only a subset of columns (slicing?) and have two methods where I think both should work but only one seems to work and trying to understand why. Method1 works but method2 returns an error KeyError: ('Name', 'Cost') method1:

import pandas as pd
purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
columns_to_keep = ['Name','Cost']
df = df[columns_to_keep]

method 2:

import pandas as pd
purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
columns_to_keep = ['Name','Cost']
df = df['Name','Cost']

As far as I can see, both seem to set the instance df with list of columns. Would like to understand why method2 doesn't work?

That's how the advanced index slicing in numpy/pandas works.

Advanced indexing is triggered when the selection object, obj, is a non-tuple sequence object, an ndarray (of data type integer or bool), or a tuple with at least one sequence object or ndarray (of data type integer or bool)

Note that in Method 2 df = df['Name','Cost'] is the same as df = df[('Name','Cost')] - which implies using a tuple as the selection object; referred to as basic indexing.

In Python, x[(exp1, exp2, ..., expN)] is equivalent to x[exp1, exp2, ..., expN] ; the latter is just syntactic sugar for the former.

You need to put the columns in an array or list (as in your method 1) not a tuple to trigger the advanced indexing that will select items from multiple columns at a go:

>>> df = df[['Name','Cost']] # also df[np.array(['Name','Cost'])] works
>>> df
          Name  Cost
Store 1  Chris  22.5
Store 1  Kevyn   2.5
Store 2  Vinod   5.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM