繁体   English   中英

如何将数据框中的一列字符串拆分为多行子文本?

[英]How to split a column of string in a dataframe into multiple rows of subtexts?

我有一个熊猫数据框。 此数据框由两列组成。 一列包含超过我计划在其上使用的转换器模型的最大序列长度的垃圾邮件字符串,另一列包含与该字符串对应的标签。 我想将长字符串拆分为单独行中的多个子文本,同时保留它们的标签对应关系。

输入数据框:

Text                                              Label
"This is a very long spam email"                  1
"This is a very long normal email"                0

期望的输出:

Maximum Sequence Length = 4

Text                                              Label
"This is a very"                                  1
"long spam email"                                 1
"This is a very"                                  0
"long normal email"                               0

我怎么能这样做?

您可以使用 .split() 方法将字符串转换为列表,然后使用 .join() 方法和 [ ] 将列表的前四个元素转换为字符串。 这是我的代码,如果你需要更长的字符串,你可以添加一个 for 循环:

def convert(string):
    nlist = string.split(' ')
    nlist1= nlist[:4]
    nlist2= nlist[4:]
    nstring1 = " ".join(nlist1)
    nstring2 = " ".join(nlist2)
    return nstring1, nstring2
    
x = "This is a very long spam email"
print(convert(x))

数据:

>>> df = pd.DataFrame({"Text" : ["This is a very very very very long spam spam span email email", "This is a a a a very long long long normal email"],
              "Label" : [1,0]})
>>> print(df.to_string())
                                                            Text  Label
0  This is a very very very very long spam spam span email email      1
1               This is a a a a very long long long normal email      0

解决方案:

# break the text column in sublists, each list contains at most 4 words.
>>> t = df.apply(lambda x:x.Text.split(), axis=1).apply(lambda x: [x[i * 4:(i + 1) * 4] for i in range((len(x) + 4 - 1) // 4 )])
df['t'] = t
>>> l = df.apply(lambda x:[x.Label] * len(x.t), axis=1)
# flat a list and make a dataframe from it. 
>>> df = pd.DataFrame({"Text" : functools.reduce(operator.iconcat, df.t.to_list(), []), 
              "Label" : functools.reduce(operator.iconcat, l.to_list(), [])})
>>> df['Text'] = df['Text'].apply(' '.join)
>>> df
    Text                    Label
0   This is a very          1
1   very very very long     1
2   spam spam span email    1
3   email                   1
4   This is a a             0
5   a a very long           0
6   long long normal email  0

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM