[英]replace strings in every column with numbers
This question is an extension of this question .这个问题是这个问题的延伸。 Consider the pandas DataFrame visualized in the table below.考虑下表中可视化的 pandas DataFrame。
respondent受访者 | brand牌 | engine引擎 | country国家 | aware知道的 | aware_2意识到_2 | aware_3意识到_3 | age年龄 | tesst测试 | set放 | |
---|---|---|---|---|---|---|---|---|---|---|
0 0 | a一个 | volvo沃尔沃 | p p | swe瑞典 | 1 1 | 0 0 | 1 1 | 23 23 | set放 | set放 |
1 1 | b b | volvo沃尔沃 | None没有任何 | swe瑞典 | 0 0 | 0 0 | 1 1 | 45 45 | set放 | set放 |
2 2 | c c | bmw宝马 | p p | us我们 | 0 0 | 0 0 | 1 1 | 56 56 | test测试 | test测试 |
3 3 | d d | bmw宝马 | p p | us我们 | 0 0 | 1 1 | 1 1 | 43 43 | test测试 | test测试 |
4 4 | e e | bmw宝马 | d d | germany德国 | 1 1 | 0 0 | 1 1 | 34 34 | set放 | set放 |
5 5 | f F | audi奥迪 | d d | germany德国 | 1 1 | 0 0 | 1 1 | 59 59 | set放 | set放 |
6 6 | g G | volvo沃尔沃 | d d | swe瑞典 | 1 1 | 0 0 | 0 0 | 65 65 | test测试 | set放 |
7 7 | h H | audi奥迪 | d d | swe瑞典 | 1 1 | 0 0 | 0 0 | 78 78 | test测试 | set放 |
8 8 | i一世 | volvo沃尔沃 | d d | us我们 | 1 1 | 1 1 | 1 1 | 32 32 | set放 | set放 |
To convert a column with String entries, one should do a map and then pandas.replace()
.要转换包含字符串条目的列,应该先执行 map ,然后pandas.replace()
。
For example:例如:
mapping = {'set': 1, 'test': 2}
df.replace({'set': mapping, 'tesst': mapping})
This would lead to the following DataFrame (table):这将导致以下 DataFrame(表):
respondent受访者 | brand牌 | engine引擎 | country国家 | aware知道的 | aware_2意识到_2 | aware_3意识到_3 | age年龄 | tesst测试 | set放 | |
---|---|---|---|---|---|---|---|---|---|---|
0 0 | a一个 | volvo沃尔沃 | p p | swe瑞典 | 1 1 | 0 0 | 1 1 | 23 23 | 1 1 | 1 1 |
1 1 | b b | volvo沃尔沃 | None没有任何 | swe瑞典 | 0 0 | 0 0 | 1 1 | 45 45 | 1 1 | 1 1 |
2 2 | c c | bmw宝马 | p p | us我们 | 0 0 | 0 0 | 1 1 | 56 56 | 2 2 | 2 2 |
3 3 | d d | bmw宝马 | p p | us我们 | 0 0 | 1 1 | 1 1 | 43 43 | 2 2 | 2 2 |
4 4 | e e | bmw宝马 | d d | germany德国 | 1 1 | 0 0 | 1 1 | 34 34 | 1 1 | 1 1 |
5 5 | f F | audi奥迪 | d d | germany德国 | 1 1 | 0 0 | 1 1 | 59 59 | 1 1 | 1 1 |
6 6 | g G | volvo沃尔沃 | d d | swe瑞典 | 1 1 | 0 0 | 0 0 | 65 65 | 2 2 | 1 1 |
7 7 | h H | audi奥迪 | d d | swe瑞典 | 1 1 | 0 0 | 0 0 | 78 78 | 2 2 | 1 1 |
8 8 | i一世 | volvo沃尔沃 | d d | us我们 | 1 1 | 1 1 | 1 1 | 32 32 | 1 1 | 1 1 |
As seen above, the last two column's strings are replaced with numbers representing these strings.如上所示,最后两列的字符串被替换为代表这些字符串的数字。
The question is then: Is there a faster and not so hands-on approach to replace all the strings into a number?那么问题来了:是否有一种更快且不那么动手的方法来将所有字符串替换为一个数字? Can one automatically create a mapping (and output it somewhere for human reference)?可以自动创建一个映射(以及 output 它在某个地方供人类参考)吗?
Something that makes the DataFrame end up like:使 DataFrame 最终变成这样的东西:
respondent受访者 | brand牌 | engine引擎 | country国家 | aware知道的 | aware_2意识到_2 | aware_3意识到_3 | age年龄 | tesst测试 | set放 | |
---|---|---|---|---|---|---|---|---|---|---|
0 0 | 1 1 | 1 1 | 1 1 | 1 1 | 1 1 | 0 0 | 1 1 | 23 23 | 1 1 | 1 1 |
1 1 | 2 2 | 1 1 | 2 2 | 1 1 | 0 0 | 0 0 | 1 1 | 45 45 | 1 1 | 1 1 |
2 2 | 3 3 | 2 2 | 1 1 | 2 2 | 0 0 | 0 0 | 1 1 | 56 56 | 2 2 | 2 2 |
3 3 | 4 4 | 2 2 | 1 1 | 2 2 | 0 0 | 1 1 | 1 1 | 43 43 | 2 2 | 2 2 |
4 4 | 5 5 | 2 2 | 3 3 | 3 3 | 1 1 | 0 0 | 1 1 | 34 34 | 1 1 | 1 1 |
5 5 | 6 6 | 3 3 | 3 3 | 3 3 | 1 1 | 0 0 | 1 1 | 59 59 | 1 1 | 1 1 |
6 6 | 7 7 | 1 1 | 3 3 | 1 1 | 1 1 | 0 0 | 0 0 | 65 65 | 2 2 | 1 1 |
7 7 | 8 8 | 3 3 | 3 3 | 1 1 | 1 1 | 0 0 | 0 0 | 78 78 | 2 2 | 1 1 |
8 8 | 9 9 | 1 1 | 3 3 | 2 2 | 1 1 | 1 1 | 1 1 | 32 32 | 1 1 | 1 1 |
Also output:还有 output:
[{'volvo': 1, 'bmw': 2, 'audi': 3}, {'p': 1, 'None': 2, 'd': 3}, {'swe': 1, 'us': 2, 'germany': 3}]
Note that the output list of maps (dicts) should not be hard-coded but instead produced by the code.请注意,地图(字典)的 output 列表不应硬编码,而是由代码生成。
You can adapte the code given in this response https://stackoverflow.com/a/39989896/15320403 (inside the post you linked) to generate a mapping for each column of your choice and apply replace as you suggested您可以调整此响应https://stackoverflow.com/a/39989896/15320403中给出的代码(在您链接的帖子中)为您选择的每一列生成映射并按照您的建议应用替换
all_brands = df.brand.unique()
brand_dic = dict(zip(all_brands, range(len(all_brands))))
You will need to first change the type of the columns to Categorical
and then create a new column or overwrite the existing column with codes
:您需要首先将列的类型更改为Categorical
,然后创建一个新列或使用codes
覆盖现有列:
df['brand'] = pd.Categorical(df['brand'])
df['brand_codes'] = df['brand'].cat.codes
If you need the mapping:如果您需要映射:
dict(enumerate(df['brand'].cat.categories )) #This will work only after you've converted the column to categorical
From the other answers, I've written this function to do solve the problem:从其他答案中,我写了这个 function 来解决这个问题:
import pandas as pd
def convertStringColumnsToNum(data):
columns = data.columns
columns_dtypes = data.dtypes
maps = []
for col_idx in range(0, len(columns)):
# don't change columns already comprising of numbers
if(columns_dtypes[col_idx] == 'int64'): # can be extended to more dtypes
continue
# inspired from Shivam Roy's answer
col = columns[col_idx]
tmp = pd.Categorical(data[col])
data[col] = tmp.codes
maps.append(tmp.categories)
return maps
This function returns the maps
s used to replace strings with a numeral code.此 function 返回用于将字符串替换为数字代码的maps
。 The code is the index in which a string resides inside the list.代码是字符串驻留在列表中的索引。 This function works, yet it comes with the SettingWithCopyWarning
.这个 function 有效,但它带有SettingWithCopyWarning
。
if it ain't broke don't fix it, right?如果它没有坏就不要修理它,对吧? ;) ;)
*but if anyone has a way to adapt this function so that the warning is no longer shown, feel free to comment on it. *但如果有人有办法调整此 function 以便不再显示警告,请随时发表评论。 Yet it works *shrugs* *然而它有效*耸耸肩* *
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.