[英]Create a new column using str.contains and where the condition fails, set it to null (NaN)
I am trying to create a new column in my pandas dataframe, but only with a value if another column contains a certain string. 我试图在我的pandas数据框中创建一个新列,但是如果另一个列包含某个字符串,则只能使用一个值。
My dataframe looks something like this: 我的数据框看起来像这样:
raw val1 val2
0 Vendor Invoice Numbe Inv Date
1 Vendor: Company Name 1 123 456
2 13445 07708-20-2019 US 432 676
3 79935 19028808-15-2019 US 444 234
4 Vendor: company Name 2 234 234
I am trying to create a new column, vendor
that transforms the dataframe into: 我正在尝试创建一个新列, vendor
将数据框转换为:
raw val1 val2 vendor
0 Vendor Invoice Numbe Inv Date Vendor Invoice Numbe Inv Date
1 Vendor: Company Name 1 123 456 Vendor: Company Name 1
2 13445 07708-20-2019 US 432 676 NaN
3 79935 19028808-15-2019 US 444 234 NaN
4 Vendor: company Name 2 234 234 company Name 2
5 Vendor: company Name 2 928 528 company Name 2
However, whenever I try, 但是,只要我尝试
df['vendor'] = df.loc[df['raw'].str.contains('Vendor', na=False), 'raw']
I get the error 我得到错误
ValueError: cannot reindex from a duplicate axis ValueError:无法从重复的轴重新索引
I know that at index 4 and 5 it's the same value for the company, but what am I doing wrong and how to I add the new column to my dataframe? 我知道索引4和索引5对公司来说具有相同的价值,但是我在做错什么以及如何将新列添加到数据框中?
The problem is df.loc[df['raw'].str.contains('Vendor', na=False), 'raw']
as different length than df
. 问题是df.loc[df['raw'].str.contains('Vendor', na=False), 'raw']
与df
长度不同。
You can try np.where
, which assigns a new columns by an np.array
of the same size, so it doesn't need index alignment. 您可以尝试np.where
,它通过大小相同的np.array
分配新列,因此不需要索引对齐。
df['vendor'] = np.where(df['raw'].str.contains('Vendor'), df['raw'], np.NaN)
您可以.extract()
在Vendor:
后面的字符串部分Vendor:
使用正向后面:
df['vendor'] = df['raw'].str.extract(r'(?<=Vendor:\\s)(.*)')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.