简体   繁体   中英

How iterate in a efficient way over Pandas dataframe with Numpy.vectorize?

I'm trying to iterate over a Pandas Dataframe using each row as a parameter function. I tried this:

def vectorize_df(df, hg):
   print(hg + str(df['tweets_id']) + df['tokenized_text'])

df = pd.DataFrame.from_records(belongs_node, columns=['tweets_id','tokenized_text'])
vfunct = numpy.vectorize(vectorize_df)
vfunct(df, "#Python")

The problem is when I do that, df parameter takes the value from 'tweets_id' instead of the all row. Thanks a lot :)

When you define a function to be vectorized, then:

  • each column should be a separate parameter,
  • you should call it passing corresponding columns,
  • "other" parameters (not taken from the source array), should be marked as "excluded" parameters.

Another detail is that a vectorized function should not print anything, but it should return some value - the result of processing parameters from the current source row.

So you could eg proceed as follows

  1. Define your function as:

     def myFunct(col1, col2, hg): return f'{hg} / {col1} / {col2}'

    Don't use the word vectorize in the name of the function. For now it is an "ordinary" function. It will be vectorized in a moment.

  2. Create the vectorized function:

     vfunct = np.vectorize(myFunct, excluded=['hg'])
  3. And finally call it:

     vfunct(df.tweets_id, df.tokenized_text, '#Python')

The result, for my sample data, is:

array(['#Python / 101 / aaa bbb ccc ddd',
       '#Python / 102 / eee fff ggg hhh iii jjj',
       '#Python / 103 / kkk lll mmm nnn ooo ppp'], dtype='<U39')

It is up to what you do with this result. You may eg set it as a new column of your source DataFrame.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM