简体   繁体   English

如何对多个数据集使用 numpy 向量化,然后调用 function?

[英]how to use numpy vectorization for mutiple datasets, and then call a function?

I have a dataset that contains name and date.我有一个包含名称和日期的数据集。 And i need to compare them to others datasets that have name and date, and call another function if the name is in it, in the example i just mocked a return, that would be assigned to a new column in the dataframe.我需要将它们与具有名称和日期的其他数据集进行比较,如果名称在其中,则调用另一个 function,在示例中我只是模拟了一个返回,它将分配给 dataframe 中的一个新列。 But i couldn't find how.但我找不到如何。 Here's what i did so far: *I need to use numpy vectorization这是我到目前为止所做的:*我需要使用 numpy 矢量化

def getName(name, date, df1, df2):
    if name  == df1['NAME'].values:
       return name
    if name  == df2['NAME'].values:
       return 'HEY'

df = pd.DataFrame({
    "NAME": ["JOE", "CHRIS", "AARON"],
    "DATE": [10, 20, 30]
})
df1 = pd.DataFrame({
    "NAME": ["JOE", "JASON", "GUS"],
    "DATE": [10, 20, 30]
})

df2 = pd.DataFrame({
    "NAME": ["STEPHEN", "CHRIS", "AARON"],
    "DATE": [10, 20, 30]
})

df['NAME_'] = getname(df['NAME'].values, df['DATE'].values, df1, df2)

The output should be: output 应该是:

df = 
NAME DATE NAME_
JOE   10   JOE
CHRIS 20   HEY
AARON 30   HEY

So you are testing equality with the == operator, which will evaluate False because name is a str and df1['NAME'] is a Series .因此,您正在使用==运算符测试相等性,这将评估 False 因为namestrdf1['NAME']Series I think you want to test if name is in a column.我认为您想测试name是否在列中。 You can do this with a construct like if name in df1['NAME'].values .您可以使用if name in df1['NAME'].values类的构造来执行此操作。

But, even if you fix the function, you can't call getName just once and get the result you are looking for.但是,即使您修复了 function,您也不能只调用一次getName并获得您正在寻找的结果。 Typically, you could use apply so the function is called for every row of df .通常,您可以使用apply以便为df的每一行调用 function 。 You can do this with df['NAME'].apply(getname, axis=1) .您可以使用df['NAME'].apply(getname, axis=1)来做到这一点。 But this isn't using vectorization, as apply is a loop behind the scenes.但这没有使用矢量化,因为apply是幕后的循环。

So perhaps you could use join所以也许你可以使用join

df1['NAME_'] = df1['NAME']
df2['NAME_'] = 'HEY'
df3 = pd.concat([df2, df3]).set_index('NAME')
df.join(df3['NAME_'], on='NAME', how='left')

Output Output

    NAME  DATE NAME_
0    JOE    10   JOE
1  CHRIS    20   HEY
2  AARON    30   HEY

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM