简体   繁体   English

如何通过标点符号拆分 pandas 列中的长字符串

[英]How to split long strings in pandas columns by punctuation

I have a df that looks like this:我有一个看起来像这样的df:

words                                              col_a   col_b  
I guess, because I have thought over that. Um,       1       0 
That? yeah.                                          1       1
I don't always think you're up to something.         0       1                                                       

I want to split df.words wherever a punctuation character is present (.,?:;;) into a separate row.我想将存在标点符号的 df.words (.,?:;;)拆分为单独的行。 However I want to preserve the col_b and col_b values from the original row for each new row.但是,我想为每个新行保留原始行中的 col_b 和 col_b 值。 For example, the above df should look like this:例如,上面的 df 应该是这样的:

words                                              col_a   col_b  
I guess,                                             1       0
because I have thought over that.                    1       0
Um,                                                  1       0 
That?                                                1       1
yeah.                                                1       1
I don't always think you're up to something.         0       1

One way is using str.findall with the pattern (.*?[.,?:;;]) to match any of these punctuation sings and the characters that preceed it (non greedy), and explode the resulting lists:一种方法是使用带有模式(.*?[.,?:;;])str.findall来匹配任何这些标点符号及其前面的字符(非贪婪),并分解结果列表:

(df.assign(words=df.words.str.findall(r'(.*?[.,?!:;])'))
   .explode('words')
   .reset_index(drop=True))

                                          words  col_a  col_b
0                                      I guess,      1      0
1             because I have thought over that.      1      0
2                                           Um,      1      0
3                                         That?      1      1
4                                         yeah.      1      1
5  I don't always think you're up to something.      0      1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM