如何根據標簽訓練/測試/拆分數據？

Question

如何根據標簽將數據拆分為訓練和測試數據集？ 標簽是 1 和 0，我想使用所有 1 作為訓練數據集和 0 作為測試數據集。 csv 文件如下所示：

1   Pixar classic is one of the best kids' movies of all time.
1   Apesar de representar um imenso avanÃ§o tecnolÃ³gico, a forÃ§a do filme reside no carisma de seus personagens e no charme de sua histÃ³ria.
1   When Woody perks up in the opening scene, it's not only the toy cowboy who comes alive - we're watching the rebirth of an art form.
0   The humans are wooden, the computer-animals have that floating, jerky gait of animated fauna.
1   Introduced not one but two indelible characters to the pop culture pantheon: cowboy rag-doll Woody (Tom Hanks) and plastic space ranger Buzz Lightyear (Tim Allen). [Blu-ray]
1   it is easy to see how virtually everything that is good in animation right now has some small seed in Toy Story
0   All the effects in the world can't disguise the thin plot.
1   Though some of the animation seems dated compared to later Pixar efforts and not nearly as detailed, what's here is done impeccably well.

Answer 1

通常您不想這樣做，但是，以下解決方案可以工作。 我嘗試了一個非常小的數據框，但似乎可以完成這項工作。

import pandas as pd  

Df = pd.DataFrame()
Df['label'] = ['S', 'S', 'S', 'P', 'P', 'S', 'P', 'S']
Df['value'] = [1, 2, 3, 4, 5, 6, 7, 8]
Df

X = Df[Df.label== 'S']
Y = Df[Df.label == 'P']

from sklearn.model_selection import train_test_split
xtrain, ytrain = train_test_split(X, test_size=0.3,random_state=25, shuffle=True)
xtest, ytest = train_test_split(Y, test_size=0.3,random_state=25, shuffle=True)

我得到了以下結果

xtrain

    label   value
5   S       6
2   S       3
7   S       8

xtest

    label   value
6   P       7
3   P       4

ytest

    label   value
4   P       5

ytrain

    label   value
0   S       1
1   S       2

Answer 2

嘗試這個，

mask = df['label']==1
df_train = df[mask]
df_test = df[~mask]

你只需要過濾數據框。

Answer 3

d = {'col1': [1, 1, 1, 1, 0, 0, 0, 0], 'text': ["a", "b", "c", "d", "e", "f", "g", "h"]}
df = pd.DataFrame(data=d)
df.head()

    label   text
0   1       a
1   1       b
2   1       c
3   1       d
4   0       e

您可以使用下面的代碼根據每一行值進行過濾，這會在 col1 等於 1 時從 col1 中捕獲數據。

traindf = df[df["label"] == 1]
traindf

    label   text
0   1       a
1   1       b
2   1       c
3   1       d

testdf = df[df["label"] == 0]
testdf

    label   text
4   0       e
5   0       f
6   0       g
7   0       h

如何根據標簽訓練/測試/拆分數據？

問題描述

3 個解決方案

解決方案1
1 2019-08-01 08:02:08

解決方案2
0 2019-08-01 06:11:00

解決方案3
0 2019-08-01 08:14:18

如何根據標簽訓練/測試/拆分數據？

問題描述

3 個解決方案

解決方案1 1 2019-08-01 08:02:08

解決方案2 0 2019-08-01 06:11:00

解決方案3 0 2019-08-01 08:14:18

解決方案1
1 2019-08-01 08:02:08

解決方案2
0 2019-08-01 06:11:00

解決方案3
0 2019-08-01 08:14:18