简体   繁体   中英

Pythonic way to create dataset for multilabel text classification

I have a text dataset that looks like this.

import pandas as pd
df = pd.DataFrame({'Sentence': ['Hello World',
                                'The quick brown fox jumps over the lazy dog.',
                                'Just some text to make third sentence!'
                               ],
                   'label': ['greetings',
                             'dog,fox',
                             'some_class,someother_class'
                            ]})

在此处输入图像描述

I want to transform this data into something like this. 这就是数据框在转换后的样子。

Is there a pythonic way to make this transformation for multilabel classification?

You can use pandas.Series.explode to explode the label column then cross it with the sentences column by using pandas.crosstab .

Try this:

def cross_labels(df):
    return pd.crosstab(df["Sentence"], df["label"])

out = (
        df.assign(label= df["label"].str.split(","))
          .explode("label")
          .pipe(cross_labels)
          .rename_axis(None, axis=1)
          .reset_index()
      )

# Output:

print(out)

                                       Sentence  dog  fox  greetings  some_class  someother_class
0                                   Hello World    0    0          1           0                0
1        Just some text to make third sentence!    0    0          0           1                1
2  The quick brown fox jumps over the lazy dog.    1    1          0           0                0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM