[英]Pandas split name column into first and last name if contains one space
Let's say I have a pandas DataFrame containing names like so: 假设我有一个熊猫DataFrame,其中包含如下名称:
name_df = pd.DataFrame({'name':['Jack Fine','Kim Q. Danger','Jane Smith', 'Juan de la Cruz']})
name
0 Jack Fine
1 Kim Q. Danger
2 Jane Smith
3 Juan de la Cruz
and I want to split the name
column into first_name
and last_name
IF there is one space in the name. 如果
name
有一个空格,我想将name
列分为first_name
和last_name
。 Otherwise I want the full name to be shoved into first_name
. 否则,我希望将全名推入
first_name
。
So the final DataFrame should look like: 因此最终的DataFrame应该看起来像:
first_name last_name
0 Jack Fine
1 Kim Q. Danger
2 Jane Smith
3 Juan de la Cruz
I've tried to accomplish this by first applying the following function to return names that can be split into first and last name: 我试图通过首先应用以下函数来返回可拆分为名字和姓氏的名称来实现此目的:
def validate_single_space_name(name: str) -> str:
pattern = re.compile(r'^.*( ){1}.*$')
match_obj = re.match(pattern, name)
if match_obj:
return name
else:
return None
However applying this function to my original name_df, leads to an empty DataFrame, not one populated by names that can be split and Nones. 但是将此功能应用于我的原始name_df会导致一个空的DataFrame,而不是由可以拆分的名称和None填充的一个。
Help getting my current approach to work, or solutions invovling a different approach would be appreciated! 帮助使我目前的方法起作用,或者采用其他方法的解决方案将不胜感激!
You can use str.split
to split the strings, then test the number of splits using str.len
and use this as a boolean mask to assign just those rows with the last component of the split: 您可以使用
str.split
拆分字符串,然后使用str.len
测试拆分的数量,并将其用作布尔掩码,仅分配具有拆分的最后一部分的那些行:
In [33]:
df.loc[df['name'].str.split().str.len() == 2, 'last name'] = df['name'].str.split().str[-1]
df
Out[33]:
name last name
0 Jack Fine Fine
1 Kim Q. Danger NaN
2 Jane Smith Smith
3 Juan de la Cruz NaN
EDIT 编辑
You can call split
with param expand=True
this will only populate where the name lengths are exactly 2 names: 您可以使用param
expand=True
调用split
,这只会在名称长度恰好是2个名称的地方填充:
In [16]:
name_df[['first_name','last_name']] = name_df['name'].loc[name_df['name'].str.split().str.len() == 2].str.split(expand=True)
name_df
Out[16]:
name first_name last_name
0 Jack Fine Jack Fine
1 Kim Q. Danger NaN NaN
2 Jane Smith Jane Smith
3 Juan de la Cruz NaN NaN
You can then replace the missing first names using fillna
: 然后,您可以使用
fillna
替换缺少的名字:
In [17]:
name_df['first_name'].fillna(name_df['name'],inplace=True)
name_df
Out[17]:
name first_name last_name
0 Jack Fine Jack Fine
1 Kim Q. Danger Kim Q. Danger NaN
2 Jane Smith Jane Smith
3 Juan de la Cruz Juan de la Cruz NaN
I was having some issues with IndexError: list index out of range
because the names could be test
, kk
and other weird user input. 我在
IndexError: list index out of range
遇到了一些问题IndexError: list index out of range
因为名称可能是test
, kk
和其他奇怪的用户输入。 So ended up with something like this: 所以最终得到这样的东西:
items['fullNameSplitLength'] = items['fullName'].str.split().str.len()
items['firstName'] = items['lastName'] = ''
items.loc[
items['fullNameSplitLength'] >= 1,
'firstName'
] = items.loc[items['fullNameSplitLength'] >= 1]['fullName'].str.split().str[0]
items.loc[
items['fullNameSplitLength'] >= 2,
'lastName'
] = items.loc[items['fullNameSplitLength'] >= 2]['fullName'].str.split().str[-1]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.