简体   繁体   English

用数字替换每列中的字符串

[英]replace strings in every column with numbers

This question is an extension of this question .这个问题是这个问题的延伸。 Consider the pandas DataFrame visualized in the table below.考虑下表中可视化的 pandas DataFrame。

respondent受访者 brand engine引擎 country国家 aware知道的 aware_2意识到_2 aware_3意识到_3 age年龄 tesst测试 set
0 0 a一个 volvo沃尔沃 p p swe瑞典 1 1 0 0 1 1 23 23 set set
1 1 b b volvo沃尔沃 None没有任何 swe瑞典 0 0 0 0 1 1 45 45 set set
2 2 c c bmw宝马 p p us我们 0 0 0 0 1 1 56 56 test测试 test测试
3 3 d d bmw宝马 p p us我们 0 0 1 1 1 1 43 43 test测试 test测试
4 4 e e bmw宝马 d d germany德国 1 1 0 0 1 1 34 34 set set
5 5 f F audi奥迪 d d germany德国 1 1 0 0 1 1 59 59 set set
6 6 g G volvo沃尔沃 d d swe瑞典 1 1 0 0 0 0 65 65 test测试 set
7 7 h H audi奥迪 d d swe瑞典 1 1 0 0 0 0 78 78 test测试 set
8 8 i一世 volvo沃尔沃 d d us我们 1 1 1 1 1 1 32 32 set set

To convert a column with String entries, one should do a map and then pandas.replace() .要转换包含字符串条目的列,应该先执行 map ,然后pandas.replace()

For example:例如:

mapping = {'set': 1, 'test': 2}
df.replace({'set': mapping, 'tesst': mapping})

This would lead to the following DataFrame (table):这将导致以下 DataFrame(表):

respondent受访者 brand engine引擎 country国家 aware知道的 aware_2意识到_2 aware_3意识到_3 age年龄 tesst测试 set
0 0 a一个 volvo沃尔沃 p p swe瑞典 1 1 0 0 1 1 23 23 1 1 1 1
1 1 b b volvo沃尔沃 None没有任何 swe瑞典 0 0 0 0 1 1 45 45 1 1 1 1
2 2 c c bmw宝马 p p us我们 0 0 0 0 1 1 56 56 2 2 2 2
3 3 d d bmw宝马 p p us我们 0 0 1 1 1 1 43 43 2 2 2 2
4 4 e e bmw宝马 d d germany德国 1 1 0 0 1 1 34 34 1 1 1 1
5 5 f F audi奥迪 d d germany德国 1 1 0 0 1 1 59 59 1 1 1 1
6 6 g G volvo沃尔沃 d d swe瑞典 1 1 0 0 0 0 65 65 2 2 1 1
7 7 h H audi奥迪 d d swe瑞典 1 1 0 0 0 0 78 78 2 2 1 1
8 8 i一世 volvo沃尔沃 d d us我们 1 1 1 1 1 1 32 32 1 1 1 1

As seen above, the last two column's strings are replaced with numbers representing these strings.如上所示,最后两列的字符串被替换为代表这些字符串的数字。

The question is then: Is there a faster and not so hands-on approach to replace all the strings into a number?那么问题来了:是否有一种更快且不那么动手的方法来将所有字符串替换为一个数字? Can one automatically create a mapping (and output it somewhere for human reference)?可以自动创建一个映射(以及 output 它在某个地方供人类参考)吗?

Something that makes the DataFrame end up like:使 DataFrame 最终变成这样的东西:

respondent受访者 brand engine引擎 country国家 aware知道的 aware_2意识到_2 aware_3意识到_3 age年龄 tesst测试 set
0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 23 23 1 1 1 1
1 1 2 2 1 1 2 2 1 1 0 0 0 0 1 1 45 45 1 1 1 1
2 2 3 3 2 2 1 1 2 2 0 0 0 0 1 1 56 56 2 2 2 2
3 3 4 4 2 2 1 1 2 2 0 0 1 1 1 1 43 43 2 2 2 2
4 4 5 5 2 2 3 3 3 3 1 1 0 0 1 1 34 34 1 1 1 1
5 5 6 6 3 3 3 3 3 3 1 1 0 0 1 1 59 59 1 1 1 1
6 6 7 7 1 1 3 3 1 1 1 1 0 0 0 0 65 65 2 2 1 1
7 7 8 8 3 3 3 3 1 1 1 1 0 0 0 0 78 78 2 2 1 1
8 8 9 9 1 1 3 3 2 2 1 1 1 1 1 1 32 32 1 1 1 1

Also output:还有 output:

[{'volvo': 1, 'bmw': 2, 'audi': 3}, {'p': 1, 'None': 2, 'd': 3}, {'swe': 1, 'us': 2, 'germany': 3}]

Note that the output list of maps (dicts) should not be hard-coded but instead produced by the code.请注意,地图(字典)的 output 列表不应硬编码,而是由代码生成。

You can adapte the code given in this response https://stackoverflow.com/a/39989896/15320403 (inside the post you linked) to generate a mapping for each column of your choice and apply replace as you suggested您可以调整此响应https://stackoverflow.com/a/39989896/15320403中给出的代码(在您链接的帖子中)为您选择的每一列生成映射并按照您的建议应用替换

all_brands = df.brand.unique()
brand_dic = dict(zip(all_brands, range(len(all_brands))))

You will need to first change the type of the columns to Categorical and then create a new column or overwrite the existing column with codes :您需要首先将列的类型更改为Categorical ,然后创建一个新列或使用codes覆盖现有列:

df['brand'] = pd.Categorical(df['brand'])
df['brand_codes'] = df['brand'].cat.codes

If you need the mapping:如果您需要映射:

dict(enumerate(df['brand'].cat.categories )) #This will work only after you've converted the column to categorical

From the other answers, I've written this function to do solve the problem:从其他答案中,我写了这个 function 来解决这个问题:

import pandas as pd

def convertStringColumnsToNum(data):
    columns = data.columns
    columns_dtypes = data.dtypes
    maps = []
    
    for col_idx in range(0, len(columns)):
        # don't change columns already comprising of numbers
        if(columns_dtypes[col_idx] == 'int64'): # can be extended to more dtypes
            continue
        # inspired from Shivam Roy's answer 
        col = columns[col_idx]
        tmp = pd.Categorical(data[col])
        data[col] = tmp.codes
        maps.append(tmp.categories)

    return maps

This function returns the maps s used to replace strings with a numeral code.此 function 返回用于将字符串替换为数字代码的maps The code is the index in which a string resides inside the list.代码是字符串驻留在列表中的索引。 This function works, yet it comes with the SettingWithCopyWarning .这个 function 有效,但它带有SettingWithCopyWarning

if it ain't broke don't fix it, right?如果它没有坏就不要修理它,对吧? ;) ;)

*but if anyone has a way to adapt this function so that the warning is no longer shown, feel free to comment on it. *但如果有人有办法调整此 function 以便不再显示警告,请随时发表评论。 Yet it works *shrugs* *然而它有效*耸耸肩* *

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM