[英]How to split Dataframe column into two parts and replace column with splitted value
[英]How do I split a Dataframe column in two but taking into account that sometimes there is nothing to split and the value belongs to the second column?
想象一下我有:
COLUMN A
0 00000-UNITED STATES
1 01000-ALABAMA
2 01001-Autauga County, AL
3 01003-Baldwin County, AL
4 Barbour County, AL
我想將它們分成兩列,但要確保如果最后一行中的值是一個字符串,它會轉到第二列。 如果它是 int 則轉到第一列。 例如用字符串:
COLUMN B COLUMN C
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 Barbour County, AL
我試過這個:
df[['B','C']] = df.A.str.split(" - ", n = 1, expand=True)
它顯然返回了這個:
COLUMN B COLUMN C
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 Barbour County, AL
嘗試使用extract
和正則表達式使第二個捕獲組成為可選-
之后的值:
df[['B', 'C']] = df['A'].str.extract(r"(\d+$|\d+(?=\s*-))?(?:\s*-\s*)?(.+)?")
A B C
0 00000-UNITED STATES 00000 UNITED STATES
1 01000-ALABAMA 01000 ALABAMA
2 01001-Autauga County, AL 01001 Autauga County, AL
3 01003-Baldwin County, AL 01003 Baldwin County, AL
4 Barbour County, AL NaN Barbour County, AL
5 10234 10234 NaN
6 32 Alabama NaN 32 Alabama
7 432423 - state 432423 state
完整代碼:
import pandas as pd
df = pd.DataFrame({
'A': ['00000-UNITED STATES', '01000-ALABAMA',
'01001-Autauga County, AL', '01003-Baldwin County, AL',
'Barbour County, AL', '10234', '32 Alabama', '432423 - state']
})
df[['B', 'C']] = df['A'].str.extract(r"(\d+$|\d+(?=\s*-))?(?:\s*-\s*)?(.+)?")
您可以創建兩個函數來從 COLUMN A 中提取所需元素並分配給 COLUMN B 和 COLUMN C:
def get_col_b(item):
if '-' in item:
return item.split('-')[0]
else:
return ''
def get_col_c(item):
if '-' in item:
return item.split('-')[1]
else:
return item
創建列,然后刪除 COLUMN A:
df['COLUMN B'] = df['COLUMN A'].apply(get_col_b)
df['COLUMN C'] = df['COLUMN A'].apply(get_col_c)
cols = ['COLUMN B', 'COLUMN C']
df = df[cols]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.