简体   繁体   English

如何将pandas数据帧中的字符串值替换为整数?

[英]How to replace string values in pandas dataframe to integers?

I have a Pandas DataFrame that contains several string values. 我有一个包含几个字符串值的Pandas DataFrame。 I want to replace them with integer values in order to calculate similarities. 我想用整数值替换它们以计算相似性。 For example: 例如:

stores[['CNPJ_Store_Code','region','total_facings']].head()
Out[24]: 
    CNPJ_Store_Code      region  total_facings
1    93209765046613   Geo RS/SC       1.471690
16   93209765046290   Geo RS/SC       1.385636
19   93209765044084  Geo PR/SPI       0.217054
21   93209765044831   Geo RS/SC       0.804633
23   93209765045218  Geo PR/SPI       0.708165

and I want to replace region == 'Geo RS/SC' ==> 1, region == 'Geo PR/SPI'==> 2 etc. 我想替换region =='Geo RS / SC'==> 1,region =='Geo PR / SPI'==> 2等

Clarification: I want to do the replacement automatically, without creating a dictionary first, since I don't know in advance what my regions will be. 澄清:我想先自动更换,而不先创建字典,因为我事先并不知道我的地区会是什么。 Any ideas? 有任何想法吗? I am trying to use DictVectorizer, with no success. 我试图使用DictVectorizer,但没有成功。

I'm sure there's a way to do it in intelligent way, but I just can't find it. 我确信有一种方法可以通过智能方式实现,但我找不到它。

Anyone familiar with a solution? 有人熟悉解决方案吗?

You can use the .apply() function and a dictionary to map all known string values to their corresponding integer values: 您可以使用.apply()函数和字典将所有已知字符串值映射到其对应的整数值:

region_dictionary = {'Geo RS/SC': 1, 'Geo PR/SPI' : 2, .... }
stores['region'] = stores['region'].apply(lambda x: region_dictionary[x])

It looks to me like you really would like panda categories 它看起来像你真的想要熊猫类别

http://pandas-docs.github.io/pandas-docs-travis/categorical.html http://pandas-docs.github.io/pandas-docs-travis/categorical.html

I think you just need to change the dtype of your text column to "category" and you are done. 我认为你只需要将文本列的dtype更改为“category”即可。

stores['region'] = stores["region"].astype('category')

You can do: 你可以做:

df = pd.read_csv(filename, index_col = 0)  # Assuming it's a csv file.

def region_to_numeric(a):
    if a == 'Geo RS/SC':
        return 1
    if a == 'Geo PR/SPI':
        return 2


df['region_num'] = df['region'].apply(region_to_numeric)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM