[英]How to split long strings in pandas columns by punctuation
I have a df that looks like this:我有一个看起来像这样的df:
words col_a col_b
I guess, because I have thought over that. Um, 1 0
That? yeah. 1 1
I don't always think you're up to something. 0 1
I want to split df.words wherever a punctuation character is present (.,?:;;)
into a separate row.我想将存在标点符号的 df.words
(.,?:;;)
拆分为单独的行。 However I want to preserve the col_b and col_b values from the original row for each new row.但是,我想为每个新行保留原始行中的 col_b 和 col_b 值。 For example, the above df should look like this:
例如,上面的 df 应该是这样的:
words col_a col_b
I guess, 1 0
because I have thought over that. 1 0
Um, 1 0
That? 1 1
yeah. 1 1
I don't always think you're up to something. 0 1
One way is using str.findall
with the pattern (.*?[.,?:;;])
to match any of these punctuation sings and the characters that preceed it (non greedy), and explode the resulting lists:一种方法是使用带有模式
(.*?[.,?:;;])
的str.findall
来匹配任何这些标点符号及其前面的字符(非贪婪),并分解结果列表:
(df.assign(words=df.words.str.findall(r'(.*?[.,?!:;])'))
.explode('words')
.reset_index(drop=True))
words col_a col_b
0 I guess, 1 0
1 because I have thought over that. 1 0
2 Um, 1 0
3 That? 1 1
4 yeah. 1 1
5 I don't always think you're up to something. 0 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.