简体   繁体   English

根据dict,在DataFrame / Series中映射值的最快方法是什么?

[英]What is the fastest way to map value in DataFrame/Series according to a dict?

I got a data set with 5,000,000 rows x 3 columns. 我有一个5,000,000行x 3列的数据集。

Basically, it looks like: 基本上,它看起来像:

    location       os  clicked
0      China      ios      1
1        USA  android      0
2      Japan      ios      0
3      China  android      1

So, I went to Pandas.DataFrame for some awesome and fast support. 因此,我去了Pandas.DataFrame获得了一些很棒的快速支持。

Now I am going to replace the values located in the series of dataframes according to a dict. 现在,根据一个命令,我将替换位于一系列数据框中的值。

NOTE: the dict I used as reference looks like: 注意:我用作参考的字典看起来像:

{   China : 1,
      USA : 2,
    Japan : 3,
     .... : ..
 }

BECAUSE I use Pandas.DataFrame.Column_Label.drop_duplicates() . 因为我使用Pandas.DataFrame.Column_Label.drop_duplicates()

Finally, I got: 终于,我得到了:

    location     os  clicked
0         1      ios      1
1         2  android      0
2         3      ios      0
3         1  android      1

I have done the fully mapping in 446 s . 我已经在446秒内完成了完全映射。

Is there a faster way to do this? 有更快的方法吗?

I think the replace() function has wasted time a lot for pointless searching. 我认为replace()函数为无意义的搜索浪费了很多时间。 So am I heading to the right end? 那么,我要走向正确的终点吗?

I can answer my own question now. 我现在可以回答我自己的问题。

The point of doing this is about handling categorical data, which appeared over and over again on Classification tasks and etc. It's universal in the first place that we want to use one-hot encoding method to convert categorical data to numerical vector, acceptable for sklearn package or statsmodel. 这样做的重点是处理分类数据,这些数据一遍又一遍地出现在“分类”任务等中。首先,我们普遍希望使用一种热编码方法将分类数据转换为数值向量,这对于sklearn来说是可以接受的包或统计模型。

To do so, simply read the cvs file as pandas.DataFrame by using: data = pd.read_csv(dir, encoding='utf-8') 为此,只需使用以下命令将cvs文件读取为pandas.DataFrame: data = pd.read_csv(dir, encoding='utf-8')

then: 然后:

data_binary = pd.get_dummies(data, prefix=['os','locate'],columns=['os','location'])

and all good to go. 一切顺利。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM