简体   繁体   English

在 Python/Pandas 中,将自定义 function 应用于输入包含字符串的 dataframe 的列的最有效方法是什么?

[英]In Python/Pandas, what is the most efficient way, to apply a custom function, to a column of a dataframe, where the input includes strings?

I have a very large Dataframe, where one column contains numbers and another contains text.我有一个非常大的 Dataframe,其中一列包含数字,另一列包含文本。 I want to create a 3rd column, based on the number column and the text column and a complex custom function, in the most efficient way.我想以最有效的方式基于数字列和文本列以及复杂的自定义 function 创建第 3 列。

According to this source , the most efficient way is using NumPy vectorization.根据这个来源,最有效的方法是使用 NumPy 矢量化。

(Below is simplified example code to clarify what I tried and where I am stuck. The actual custom function is quite complex, but does indeed take as input numerical columns and text columns. With this simplified code below I want to understand how to apply functions that take strings as input on entire columns) (下面是简化的示例代码,以阐明我尝试的内容和卡住的位置。实际的自定义 function 非常复杂,但确实将数字列和文本列作为输入。通过下面的简化代码,我想了解如何应用函数将字符串作为整个列的输入)

This works flawlessly, so far so good:这完美无缺,到目前为止一切顺利:

def fun_test1(no1, no2):
    res = no1 + no2
    return res

Test1 = pd.DataFrame({'no1':[1, 2, 3],
                     'no2':[1, 2, 3]})

Test1['result'] = fun_test1(Test1['no1'].values, Test1['no2'].values)

    no1 no2 result
0   1   1   2
1   2   2   4
2   3   3   6

This however does not work and this is where I am stuck:然而,这不起作用,这就是我被困的地方:

def fun_test2(no1, text):
    if text == 'one':
        no2 = 1
    elif text == 'two':
        no2 = 2
    elif text == 'three':
        no2 = 3
    res = no1 + no2
    return res

Test2 = pd.DataFrame({'no1':[1, 2, 3],
                      'text':['one', 'two', 'three']})

Test2['result'] = fun_test2(Test2['no1'].values, Test2['text'].values)

ValueError                                Traceback (most recent call last)
<ipython-input-30-a8f100d7d4bd> in <module>()
----> 1 Test2['result'] = fun_test2(Test2['no1'].values, Test2['text'].values)

<ipython-input-27-8347aa91d765> in fun_test2(no1, text)
      1 def fun_test2(no1, text):
----> 2     if text == 'one':
      3         no2 = 1
      4     elif text == 'two':
      5         no2 = 2

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I have tried more variations but ultimately I cannot get NumPy vectorization to work with string inputs.我尝试了更多变体,但最终无法获得 NumPy 向量化来处理字符串输入。

What am I doing wrong?我究竟做错了什么?

If NumPy vectorization does not work with strings, what would be the next most efficient method?如果 NumPy 向量化不适用于字符串,下一个最有效的方法是什么?

def fun_test2(no1, text, idx):
    if text[idx] == 'one':
        no2 = 1
    elif text[idx] == 'two':
        no2 = 2
    elif text[idx] == 'three':
        no2 = 3
    res = no1[idx] + no2
    return res

Test2 = pd.DataFrame({'no1':[1, 2, 3],
                      'text':['one', 'two', 'three']})

Test2['result'] = [fun_test2(Test2['no1'].values, Test2['text'].values, i) for i in range(Test2.shape[0])]

Ouput:输出:

>>> Test2
   no1   text  result
0    1    one       2
1    2    two       4
2    3  three       6

OR back to traditional way with the same Output:或使用相同的 Output 返回传统方式:

def fun_test2(no1, text):
    if text == 'one':
        no2 = 1
    elif text == 'two':
        no2 = 2
    elif text == 'three':
        no2 = 3
    res = no1 + no2
    return res

Test2 = pd.DataFrame({'no1':[1, 2, 3],
                      'text':['one', 'two', 'three']})

Test2['result'] = [fun_test2(Test2['no1'].values[i], Test2['text'].values[i]) for i in range(Test2.shape[0])]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 function 应用于 dask dataframe 中的列的最有效方法是什么? - What is the most efficient method to apply a function to a column in a dask dataframe? pandas DataFrame 中映射列的最有效方法 - Most efficient way of mapping column in pandas DataFrame 创建大熊猫数据框的最快/计算效率最高的方法,其中的列填充有几百万行的随机字符串? - Fastest/most computationally efficient way to create a pandas dataframe where columns are filled with random strings, for several million rows? 比较Python中2个字符串的最有效方法是什么 - What is the most efficient way of comparring 2 strings in Python 将多处理应用于 Pandas 数据框中唯一类别的条目的最有效方法是什么? - What is the most efficient way to apply multiprocessing to unique categories of entries in a pandas dataframe? 如何通过python / pandas中另一个数据框的值来标记一个数据框的列的最有效方式? - How to flag the most efficient way a column of a dataframe by values of another dataframe's in python/pandas? Python:什么是在Pandas中替换字符串的有效方法 - Python: what is an efficient way to replace strings in Pandas 在 Pandas DataFrame 中转换列值的最有效方法 - Most efficient way to convert values of column in Pandas DataFrame 将pandas dataframe列拆分为多个列的最有效方法 - Most efficient way to split a pandas dataframe column into several columns 在 pandas Dataframe 中处理字符串列的最有效方法 - Most efficient way to work with a string column in a pandas Dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM