繁体   English   中英

重新编码Python Pandas中的分类标签

[英]Recoding categorical labels in Python Pandas

我正在努力重新编码一些分类标签,这是我工作的最小示例。

import pandas as pd
testDict = {'Col1' : pd.Categorical(["a", "b", "c", "d", "e"]),
          'Col2' : pd.Categorical(["1", "2", "3", "4", "5"])}

testDF = pd.DataFrame.from_dict(testDict)
testDF
testDF['Col1'].value_counts()
def letter_recode(Col1):
    if(Col1=="a")|(Col1=="b"):
        return "ab"
    elif (Col1=="c")|(Col1=="d"):
        return "cd"
    else:
        return Col1

testDF['Col3'] = testDF['Col1'].apply(letter_recode)

testDF['Col3'].value_counts()
testDF

我想更改此df:

   Col1 Col2
0   a   1
1   b   2
2   c   3
3   d   4
4   e   5

对此:

  Col1 Col2 Col3
0   a   1   ab
1   b   2   ab
2   c   3   cd
3   d   4   cd
4   e   5   e

上面的方法有效,但是当我在实际数据帧上尝试此代码时,没有任何变化。 另外,当我尝试创建数据帧的小片段并运行代码时,出现以下错误,并且不了解与此相关的文档。

df5 = df.loc[0:4,:]
df5
    age workclass   fnlwgt  education   education-num   marital-status  occupation  relationship    race    sex capital-gain    capital-loss    hours-per-week  native-country  salary  workclassR
0   50  Self-emp-not-inc    83311   Bachelors   13  Married-civ-spouse  Exec-managerial Husband White   Male    0   0   13  United-States   <=50K   Self-emp-not-inc
1   38  Private 215646  HS-grad 9   Divorced    Handlers-cleaners   Not-in-family   White   Male    0   0   40  United-States   <=50K   Private
2   53  Private 234721  11th    7   Married-civ-spouse  Handlers-cleaners   Husband Black   Male    0   0   40  United-States   <=50K   Private
3   28  Private 338409  Bachelors   13  Married-civ-spouse  Prof-specialty  Wife    Black   Female  0   0   40  Cuba    <=50K   Private
4   37  Private 284582  Masters 14  Married-civ-spouse  Exec-managerial Wife    White   Female  0   0   40  United-States   <=50K   Private

def rename_workclass(wc):
    if(wc=="Never-worked")|(wc=="Without-pay"):
        return "Unemployed"
    elif (wc=="State-gov")|(wc=="Local-gov"):
        return "Gov"
    elif (wc=="Self-emp-inc")|(wc=="Self-emp-not-inc"):
        return "Self-emp"
    else:
        return wc


df5['workclassR'] = df5['workclass'].apply(rename_workclass)

C:\\ Users \\ karol \\ Anaconda3 \\ lib \\ site-packages \\ ipykernel_launcher.py:12:SettingWithCopyWarning:试图在DataFrame的切片副本上设置一个值。 尝试改用.loc [row_indexer,col_indexer] = value

请参阅文档中的警告: http : //pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy如果sys.path [0] =='':

非常感谢您的帮助,我的问题是值前面有空格。 我试图将它们与没有空格的字符串进行比较。 同样,可以通过声明切片的数据集不是副本来消除上述错误:

df5 = df.iloc[0:4, :]  # to access the column at the nth position
df5.is_copy = False

尝试使用pd.Series.map() 一个玩具示例:

s = s.map({"Private": "Private-changed", 
       "Public": "Public_changed",
       "?": "What is this"})
s

这给您:

0    Private-changed
1     Public_changed
2       What is this

您可以将pd.Series.map与字典配合使用,然后将fillna与原始系列配合使用:

import pandas as pd

df = pd.DataFrame({'Col1' : pd.Categorical(["a", "b", "c", "d", "e"]),
                   'Col2' : pd.Categorical(["1", "2", "3", "4", "5"])})

mapper = {'a': 'ab', 'b': 'ab', 'c': 'cd', 'd': 'cd'}

df['Col3'] = df['Col1'].map(mapper).fillna(df['Col1'])

print(df['Col3'].value_counts())

cd    2
ab    2
e     1
Name: Col3, dtype: int64

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM