I am trying to create a new column in my pandas dataframe, but only with a value if another column contains a certain string.
My dataframe looks something like this:
raw val1 val2
0 Vendor Invoice Numbe Inv Date
1 Vendor: Company Name 1 123 456
2 13445 07708-20-2019 US 432 676
3 79935 19028808-15-2019 US 444 234
4 Vendor: company Name 2 234 234
I am trying to create a new column, vendor
that transforms the dataframe into:
raw val1 val2 vendor
0 Vendor Invoice Numbe Inv Date Vendor Invoice Numbe Inv Date
1 Vendor: Company Name 1 123 456 Vendor: Company Name 1
2 13445 07708-20-2019 US 432 676 NaN
3 79935 19028808-15-2019 US 444 234 NaN
4 Vendor: company Name 2 234 234 company Name 2
5 Vendor: company Name 2 928 528 company Name 2
However, whenever I try,
df['vendor'] = df.loc[df['raw'].str.contains('Vendor', na=False), 'raw']
I get the error
ValueError: cannot reindex from a duplicate axis
I know that at index 4 and 5 it's the same value for the company, but what am I doing wrong and how to I add the new column to my dataframe?
The problem is df.loc[df['raw'].str.contains('Vendor', na=False), 'raw']
as different length than df
.
You can try np.where
, which assigns a new columns by an np.array
of the same size, so it doesn't need index alignment.
df['vendor'] = np.where(df['raw'].str.contains('Vendor'), df['raw'], np.NaN)
您可以.extract()
在Vendor:
后面的字符串部分Vendor:
使用正向后面:
df['vendor'] = df['raw'].str.extract(r'(?<=Vendor:\\s)(.*)')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.