使用str.contains创建一个新列，如果条件失败，则将其设置为null（NaN）

Question

I am trying to create a new column in my pandas dataframe, but only with a value if another column contains a certain string. 我试图在我的pandas数据框中创建一个新列，但是如果另一个列包含某个字符串，则只能使用一个值。

My dataframe looks something like this: 我的数据框看起来像这样：

    raw                                     val1    val2  
0   Vendor Invoice Numbe Inv Date                        
1   Vendor: Company Name 1                  123     456   
2   13445 07708-20-2019 US                  432     676   
3   79935 19028808-15-2019 US               444     234   
4   Vendor: company Name 2                  234     234

I am trying to create a new column, vendor that transforms the dataframe into: 我正在尝试创建一个新列， vendor将数据框转换为：

    raw                                     val1    val2  vendor
0   Vendor Invoice Numbe Inv Date                         Vendor Invoice Numbe Inv Date
1   Vendor: Company Name 1                  123     456   Vendor: Company Name 1 
2   13445 07708-20-2019 US                  432     676   NaN
3   79935 19028808-15-2019 US               444     234   NaN
4   Vendor: company Name 2                  234     234   company Name 2  
5   Vendor: company Name 2                  928     528   company Name 2

However, whenever I try, 但是，只要我尝试

df['vendor'] = df.loc[df['raw'].str.contains('Vendor', na=False), 'raw']

I get the error 我得到错误

ValueError: cannot reindex from a duplicate axis ValueError：无法从重复的轴重新索引

I know that at index 4 and 5 it's the same value for the company, but what am I doing wrong and how to I add the new column to my dataframe? 我知道索引4和索引5对公司来说具有相同的价值，但是我在做错什么以及如何将新列添加到数据框中？

Answer 1

The problem is df.loc[df['raw'].str.contains('Vendor', na=False), 'raw'] as different length than df . 问题是df.loc[df['raw'].str.contains('Vendor', na=False), 'raw']与df长度不同。

You can try np.where , which assigns a new columns by an np.array of the same size, so it doesn't need index alignment. 您可以尝试np.where ，它通过大小相同的np.array分配新列，因此不需要索引对齐。

df['vendor'] = np.where(df['raw'].str.contains('Vendor'), df['raw'], np.NaN)

Answer 2

您可以.extract()在Vendor:后面的字符串部分Vendor:使用正向后面：

df['vendor'] = df['raw'].str.extract(r'(?<=Vendor:\\s)(.*)')

使用str.contains创建一个新列，如果条件失败，则将其设置为null（NaN）

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-11-25 18:17:31

解决方案2
0 2019-11-25 18:35:53

使用str.contains创建一个新列，如果条件失败，则将其设置为null（NaN）

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-11-25 18:17:31

解决方案2 0 2019-11-25 18:35:53

解决方案1
1 已采纳 2019-11-25 18:17:31

解决方案2
0 2019-11-25 18:35:53