简体   繁体   中英

What is the fastest way to map value in DataFrame/Series according to a dict?

I got a data set with 5,000,000 rows x 3 columns.

Basically, it looks like:

    location       os  clicked
0      China      ios      1
1        USA  android      0
2      Japan      ios      0
3      China  android      1

So, I went to Pandas.DataFrame for some awesome and fast support.

Now I am going to replace the values located in the series of dataframes according to a dict.

NOTE: the dict I used as reference looks like:

{   China : 1,
      USA : 2,
    Japan : 3,
     .... : ..
 }

BECAUSE I use Pandas.DataFrame.Column_Label.drop_duplicates() .

Finally, I got:

    location     os  clicked
0         1      ios      1
1         2  android      0
2         3      ios      0
3         1  android      1

I have done the fully mapping in 446 s .

Is there a faster way to do this?

I think the replace() function has wasted time a lot for pointless searching. So am I heading to the right end?

I can answer my own question now.

The point of doing this is about handling categorical data, which appeared over and over again on Classification tasks and etc. It's universal in the first place that we want to use one-hot encoding method to convert categorical data to numerical vector, acceptable for sklearn package or statsmodel.

To do so, simply read the cvs file as pandas.DataFrame by using: data = pd.read_csv(dir, encoding='utf-8')

then:

data_binary = pd.get_dummies(data, prefix=['os','locate'],columns=['os','location'])

and all good to go.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM