如何将数据框中的一列字符串拆分为多行子文本？

Question

我有一个熊猫数据框。 此数据框由两列组成。 一列包含超过我计划在其上使用的转换器模型的最大序列长度的垃圾邮件字符串，另一列包含与该字符串对应的标签。 我想将长字符串拆分为单独行中的多个子文本，同时保留它们的标签对应关系。

输入数据框：

Text                                              Label
"This is a very long spam email"                  1
"This is a very long normal email"                0

期望的输出：

Maximum Sequence Length = 4

Text                                              Label
"This is a very"                                  1
"long spam email"                                 1
"This is a very"                                  0
"long normal email"                               0

我怎么能这样做？

Answer 1

您可以使用 .split() 方法将字符串转换为列表，然后使用 .join() 方法和 [ ] 将列表的前四个元素转换为字符串。 这是我的代码，如果你需要更长的字符串，你可以添加一个 for 循环：

def convert(string):
    nlist = string.split(' ')
    nlist1= nlist[:4]
    nlist2= nlist[4:]
    nstring1 = " ".join(nlist1)
    nstring2 = " ".join(nlist2)
    return nstring1, nstring2
    
x = "This is a very long spam email"
print(convert(x))

Answer 2

数据：

>>> df = pd.DataFrame({"Text" : ["This is a very very very very long spam spam span email email", "This is a a a a very long long long normal email"],
              "Label" : [1,0]})
>>> print(df.to_string())
                                                            Text  Label
0  This is a very very very very long spam spam span email email      1
1               This is a a a a very long long long normal email      0

解决方案：

# break the text column in sublists, each list contains at most 4 words.
>>> t = df.apply(lambda x:x.Text.split(), axis=1).apply(lambda x: [x[i * 4:(i + 1) * 4] for i in range((len(x) + 4 - 1) // 4 )])
df['t'] = t
>>> l = df.apply(lambda x:[x.Label] * len(x.t), axis=1)
# flat a list and make a dataframe from it. 
>>> df = pd.DataFrame({"Text" : functools.reduce(operator.iconcat, df.t.to_list(), []), 
              "Label" : functools.reduce(operator.iconcat, l.to_list(), [])})
>>> df['Text'] = df['Text'].apply(' '.join)
>>> df
    Text                    Label
0   This is a very          1
1   very very very long     1
2   spam spam span email    1
3   email                   1
4   This is a a             0
5   a a very long           0
6   long long normal email  0

如何将数据框中的一列字符串拆分为多行子文本？

问题描述

2 个解决方案

解决方案1
0 2021-10-30 09:03:25

解决方案2
0 2021-10-30 11:48:36

如何将数据框中的一列字符串拆分为多行子文本？

问题描述

2 个解决方案

解决方案1 0 2021-10-30 09:03:25

解决方案2 0 2021-10-30 11:48:36

解决方案1
0 2021-10-30 09:03:25

解决方案2
0 2021-10-30 11:48:36