比较 numpy 数组中每个元素的矢量化方法

Question

I was wondering whether there is a way to compare each element (regardless of indexical position) in a numpy array.我想知道是否有一种方法可以比较 numpy 数组中的每个元素（无论索引位置如何）。 I often find myself using arrays from pandas dataframes and I'd like to use the underlying numpy array to do compare each element.我经常发现自己使用 pandas 数据帧中的 arrays 并且我想使用底层的 numpy 数组来比较每个元素。 I know I can do a fast-elementwise comparison like this:我知道我可以像这样进行快速元素比较：

dfarr1 = pd.DataFrame(np.arange(0,1000))
dfarr2 = pd.DataFrame(np.arange(1000,0,-1))
dfarr1.loc[(dfarr1.values == dfarr2.values)]
# outputs: 500

(the above is just a toy example, obviously) But what I'd like to do is rather the equivalent of two loops over all the elements, but in a way that is as fast as possible: （显然，上面只是一个玩具示例）但我想做的是相当于所有元素的两个循环，但以尽可能快的方式：

for ir in df.itertuples():
   for ir2 in country_df.itertuples():
      if df['city'][ir[0]] == country_df['Capital'][ir2[0]]:
         df['country'][ir[0]] = country_df['Country'][ir2[0]]

The thing is that my dataframes contains many thousands of elements and the above is simply too slow (not least given that I'm sure I'll do similar such operations in the future on different, similarly long dataframes and so clearing this once and for all would be good).问题是我的数据帧包含数千个元素，而上面的元素太慢了（尤其是考虑到我确信将来我会在不同的、同样长的数据帧上执行类似的此类操作，因此一劳永逸地清除它一切都会好的）。 The idea is that I've parsed a few thousand files and got their geodata (=df in the above) and I have a quite massive file with cities and their corresponding countries as a lookup (=country_df).这个想法是我已经解析了几千个文件并获得了它们的地理数据（上面的=df），并且我有一个相当大的文件，其中包含城市及其对应的国家/地区作为查找（=country_df）。 The idea is to see if the cities in the df match those in the lookup and if so I'd like to add the corresponding country in a new column (at the same row index) of the df with the parsed geodata.这个想法是查看 df 中的城市是否与查找中的城市匹配，如果是，我想在 df 的新列（在同一行索引处）中添加相应的国家和解析的地理数据。 Anyway, this is just an example of what I'd need at (ideally much) higher speed than the above way.无论如何，这只是我需要（理想情况下）比上述方式更高的速度的一个例子。 Many thanks!非常感谢！

Answer 1

You can try this:你可以试试这个：

 df1 = pd.DataFrame({'city': ['New York City', 'Los Angeles', 'Paris', 'Berlin', 'Beijing'], 
                     'country' : [None, None, None, None, None] })

df2 = pd.DataFrame({'city' : ['New York City', 'Paris', 'Berlin', 'Beijing', 'Los Angeles', 'Rome'],
                    'country': ['USA', 'France', 'Germany', 'China', 'USA', 'Italy']})

Now we use fillna method on df1 with df2['country'] series as filling values:现在我们在df1上使用df2['country']系列作为填充值的fillna方法：

df1['country'] = df1.set_index('city')['country'].fillna(df2.set_index('city')['country'])\
                    .reset_index(drop=True)

print(df1)

    city          country
0   New York City  USA
1   Los Angeles    USA
2   Paris          France
3   Berlin         Germany
4   Beijing        China

比较 numpy 数组中每个元素的矢量化方法

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-03-07 16:36:31

比较 numpy 数组中每个元素的矢量化方法

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-03-07 16:36:31

解决方案1
0 已采纳 2021-03-07 16:36:31