繁体   English   中英

python pandas用空格分隔两个字符串列

[英]python pandas seperate string column in two by whitespace

我有一个带有以下列“ title”的python pandas dataframe df:

title
This is the first title XY2547
This is the second title WWW48921
This is the third title  A2438999
This is another title 123 

我需要将此列分为两部分,最后是实际标题和不规则代码。 有没有一种方法可以将其按空格后面的最后一个单词拆分? 请注意,最后一个标题没有代码,并且123是标题的一部分。

最终目标DF

title                             |  cleaned title            | code
This is the first title XY2547       This is the first title    XY2547
This is the second title WWW48921    This is the second title   WWW48921
This is the third title  A2438999    This is the third title    A2438999
This is another title 123            This is another title 123

我在想类似的东西

df['code'] = df.title.str.extract(r'_\s(\w)', expand=False)

这行不通。

谢谢

尝试这个:

In [62]: df
Out[62]:
                               title
0     This is the first title XY2547
1  This is the second title WWW48921
2  This is the third title  A2438999
3         This is another title 123

In [63]: df[['cleaned_title', 'code']] = \
    ...:     df.title.str.extract(r'(.*?)\s+([A-Z]{1,}\d{3,})?$', expand=True)

In [64]: df
Out[64]:
                               title              cleaned_title      code
0     This is the first title XY2547    This is the first title    XY2547
1  This is the second title WWW48921   This is the second title  WWW48921
2  This is the third title  A2438999    This is the third title  A2438999
3         This is another title 123   This is another title 123       NaN

解决方案#1

str.rsplit可以在这里使用。 从字符串的右边开始,它分裂n次。

然后,我们可以join的结果df

df.join(
    df.title.str.rsplit(n=1, expand=True).rename(
        columns={0: 'cleaned title', 1: 'code'}
    )
)

                               title             cleaned title      code
0     This is the first title XY2547   This is the first title    XY2547
1  This is the second title WWW48921  This is the second title  WWW48921
2  This is the third title  A2438999   This is the third title  A2438999
3         This is another title 123      This is another title       123

解决方案#2

为了避免将123解释为代码,您必须应用一些未提供的其他逻辑。 @MaxU很客气,可以将他的逻辑嵌入正则表达式中。

我的regex解决方案如下所示。
计划

  • 使用'?P<name>'命名生产的列
  • 仅匹配大写字母和任何带有'[A-Z0-9]'
  • 确保'{4,}'有4个或更多
  • 从开头'^'到结尾'$'匹配
  • 确保'.*'不贪心'.*?'

regex = '^(?P<cleaned_title>.*?)\s*(?P<code>[A-Z0-9]{4,})?$'
df.join(df.title.str.extract(regex, expand=True))

                               title              cleaned_title      code
0     This is the first title XY2547    This is the first title    XY2547
1  This is the second title WWW48921   This is the second title  WWW48921
2  This is the third title  A2438999    This is the third title  A2438999
3          This is another title 123  This is another title 123       NaN

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM