简体   繁体   中英

Pandas iterate function through data frame

I'm relatively new to Python and functions. I'm attempting to iterate the following function through each row of a dataframe and append the computed result for each row to a new column:

def manhattan_distance(x,y):

  return sum(abs(a-b) for a,b in zip(x,y))

For reference, this is the dataframe I'm testing on:

entries = [
{'age1':'2', 'age2':'2'},
{'age1':'12', 'age2': '12'},
{'age1':'5', 'age2': '50'}
]

df=pd.DataFrame(entries)

df['age1'] = df['age1'].astype(str).astype(int)
df['age2'] = df['age2'].astype(str).astype(int)

I've seen this answer How to iterate over rows in a DataFrame in Pandas? and have got as far as this:

import itertools
for index, row in df.iterrows():

    df['distance']=df.apply(lambda row: manhattan_distance(row['age1'], row['age2']), axis=1)

Which returns the following:

-----------------------------------------------------------------------      ----
TypeError                                 Traceback (most recent call  last)
<ipython-input-42-aa6a21cd1de9> in <module>()
      4 #    print (manhattan_distance(row['age1'],row['age2']))
      5 
----> 6     df['distance']=df.apply(lambda row:    manhattan_distance(row['age1'], row['age2']), axis=1)

/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in   apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   4852                         f, axis,
   4853                         reduce=reduce,
-> 4854                         ignore_failures=ignore_failures)
   4855             else:
   4856                 return self._apply_broadcast(f, axis)

/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
   4948             try:
   4949                 for i, v in enumerate(series_gen):
-> 4950                     results[i] = func(v)
   4951                     keys.append(v.name)
   4952             except Exception as e:

<ipython-input-42-aa6a21cd1de9> in <lambda>(row)
      4 #    print (manhattan_distance(row['age1'],row['age2']))
      5 
----> 6     df['distance']=df.apply(lambda row:     manhattan_distance(row['age1'], row['age2']), axis=1)

<ipython-input-36-74da75398c4c> in manhattan_distance(x, y)
      1 def manhattan_distance(x,y):
      2 
----> 3   return sum(abs(a-b) for a,b in zip(x,y))
      4  #   return sum(abs(a-b) for a,b in map(lambda x: zip(a,b)))

TypeError: ('zip argument #1 must support iteration', 'occurred at index 0')

Based on other responses to the question I referred above, I have attempted to amend the zip statement in my function:

import itertools
for index, row in df.iterrows():

    df['distance']=df.apply(lambda row: manhattan_distance(row['age1'], row['age2']), axis=1)

The above returns this:

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call  last)
<ipython-input-44-aa6a21cd1de9> in <module>()
      4 #    print (manhattan_distance(row['age1'],row['age2']))
      5 
----> 6     df['distance']=df.apply(lambda row:   manhattan_distance(row['age1'], row['age2']), axis=1)

/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   4852                         f, axis,
   4853                         reduce=reduce,
-> 4854                         ignore_failures=ignore_failures)
   4855             else:
   4856                 return self._apply_broadcast(f, axis)

/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
   4948             try:
   4949                 for i, v in enumerate(series_gen):
-> 4950                     results[i] = func(v)
   4951                     keys.append(v.name)
   4952             except Exception as e:

<ipython-input-44-aa6a21cd1de9> in <lambda>(row)
      4 #    print (manhattan_distance(row['age1'],row['age2']))
      5 
----> 6     df['distance']=df.apply(lambda row:  manhattan_distance(row['age1'], row['age2']), axis=1)

<ipython-input-43-5daf167baf5f> in manhattan_distance(x, y)
      2 
      3 #  return sum(abs(a-b) for a,b in zip(x,y))
----> 4    return sum(abs(a-b) for a,b in map(lambda x: zip(a,b)))

TypeError: ('map() must have at least two arguments.', 'occurred at index 0')

If this is the right approach take, I'm unclear what my map() arguments need to be for the function to work.

import numpy as np
import pandas as pd

entries = [
{'age1':'2', 'age2':'2'},
{'age1':'12', 'age2': '12'},
{'age1':'5', 'age2': '50'}
]

df = pd.DataFrame(entries)
df['age1'] = df['age1'].astype(str).astype(int)
df['age2'] = df['age2'].astype(str).astype(int)

def manhattan_distance(row):
    # https://en.wikipedia.org/wiki/Taxicab_geometry#Formal_definition
    return np.sum(abs(row['age1']-row['age2']))

df['distance'] = df.apply(manhattan_distance, axis=1)
print(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM