简体   繁体   中英

Replace Cell Values in Pandas with matching ID from another dataframe

I have a dataframe which contains a list of domains (or vertices/nodes in my case) which I'm storing through pandas library:

                 domain
0            airbnb.com
1          facebook.com
2                st.org
3              index.co
4        crunchbase.com
5               avc.com
6        techcrunch.com
7            google.com

I have another dataframe which contains the connections between these domains (aka edges):

           source_domain    destination_domain
0             airbnb.com            google.com
1           facebook.com            google.com
2                 st.org          facebook.com
3                 st.org            airbnb.com
4                 st.org        crunchbase.com
5               index.co        techcrunch.com
6         crunchbase.com        techcrunch.com
7         crunchbase.com            airbnb.com
8                avc.com        techcrunch.com
9         techcrunch.com                st.org
10        techcrunch.com            google.com
11        techcrunch.com          facebook.com

since this dataset will get much larger, I read that I can have faster performance if I represent the "edges" dataframe only with integers instead of strings.

So, I'm wondering if there is a fast way to replace each cell in the edges dataframe with the corresponding id from the domains (aka vertices) dataframe? So row 1 in the edges dataframe might end up looking like:

###### Before: ##################### 
1           facebook.com google.com   
###### After:  #####################   
1           1            7

How can I go about doing this? Thank you in advance.

This is a good use case for Categorial Data: http://pandas.pydata.org/pandas-docs/stable/categorical.html

In short, Categorical Series will internally represent each item as a number, but display it as a string. This is useful when you have a lot of repeated strings.

It's easier and less error-prone to use Categorical Series vs converting everything to integers manually.

I try implement another answer - convert to Catagorical and for ints use cat.codes :

#if always unique domain in df1 can be omit
#cats = df1['domain'].unique()
cats = df1['domain']
df2['source_domain'] = df2['source_domain'].astype('category', categories=cats)
df2['destination_domain'] = df2['destination_domain'].astype('category', categories=cats)
df2['source_code'] = df2['source_domain'].cat.codes
df2['dest_code'] = df2['destination_domain'].cat.codes
print (df2)
     source_domain destination_domain  source_code  dest_code
0       airbnb.com         google.com            0          7
1     facebook.com         google.com            1          7
2           st.org       facebook.com            2          1
3           st.org         airbnb.com            2          0
4           st.org     crunchbase.com            2          4
5         index.co     techcrunch.com            3          6
6   crunchbase.com     techcrunch.com            4          6
7   crunchbase.com         airbnb.com            4          0
8          avc.com     techcrunch.com            5          6
9   techcrunch.com             st.org            6          2
10  techcrunch.com         google.com            6          7
11  techcrunch.com       facebook.com            6          1

df2['source_domain'] = df2['source_domain'].astype('category', categories=cats).cat.codes
df2['destination_domain'] = df2['destination_domain'].astype('category', categories=cats)
                                                     .cat.codes
print (df2)
    source_domain  destination_domain
0               0                   7
1               1                   7
2               2                   1
3               2                   0
4               2                   4
5               3                   6
6               4                   6
7               4                   0
8               5                   6
9               6                   2
10              6                   7
11              6                   1

If want replace by dict use map :

d = dict(zip(df1.domain.values, df1.index.values))
df2['source_code'] = df2['source_domain'].map(d)
df2['dest_code'] = df2['destination_domain'].map(d)
print (df2)
     source_domain destination_domain  source_code  dest_code
0       airbnb.com         google.com            0          7
1     facebook.com         google.com            1          7
2           st.org       facebook.com            2          1
3           st.org         airbnb.com            2          0
4           st.org     crunchbase.com            2          4
5         index.co     techcrunch.com            3          6
6   crunchbase.com     techcrunch.com            4          6
7   crunchbase.com         airbnb.com            4          0
8          avc.com     techcrunch.com            5          6
9   techcrunch.com             st.org            6          2
10  techcrunch.com         google.com            6          7
11  techcrunch.com       facebook.com            6          1

The simplest way to do this is to generate a dictionary from the vertices dataframe... IF we can be sure that it represents the definitive set of vertices that will show up in the edges... and use it with replace

Since the index of the vertices dataframe already has the factor information...

m = dict(zip(vertices.domain, vertices.index))
edges.replace(m)

    source_domain  destination_domain
0               0                   7
1               1                   7
2               2                   1
3               2                   0
4               2                   4
5               3                   6
6               4                   6
7               4                   0
8               5                   6
9               6                   2
10              6                   7
11              6                   1

You can also use stack / map / unstack

m = dict(zip(vertices.domain, vertices.index))
edges.stack().map(m).unstack()

    source_domain  destination_domain
0               0                   7
1               1                   7
2               2                   1
3               2                   0
4               2                   4
5               3                   6
6               4                   6
7               4                   0
8               5                   6
9               6                   2
10              6                   7
11              6                   1

editorial

I wanted to comment on @JohnZwinck's answer in addition to providing information of my own.

First, categorical would provide faster performance. However, I'm unclear of a way to ensure that you can have two columns of coordinated categories. What I mean by coordinated is that each column gets a set integers assigned to each category behind the scenes. We have know way to know or enforce (Not that I know of) that these integers are the same. If we made it one big column, then converted that column to a categorical, that would work... However, I believe that it would turn back to object once we split up into two columns again.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM