简体   繁体   中英

Is there a way to find out each occurrence of a column value in another column from a different dataset?

I have two datasets: dataset1 & dataset2 (image link provided), which have a common column called SAX which is a string object.

dataset1=
         SAX
0    glngsyu
1    zicobgm
2    eerptow
3    cqbsynt
4    zvmqben
..       ...
475  rfikekw
476  bnbzvqx
477  rsuhgax
478  ckhloio
479  lbzujtw

480 rows × 2 columns

and

dataset2 =
    SAX     timestamp
0   hssrlcu 16015
1   ktyuymp 16016
2   xncqmfr 16017
3   aanlmna 16018
4   urvahvo 16019
... ... ...
263455  jeivqzo 279470
263456  bzasxgw 279471
263457  jspqnqv 279472
263458  sxwfchj 279473
263459  gxqnhfr 279474

263460 rows × 2 columns

I need to find and print out the timestamps for whenever a value in SAX column of dataset1 exists in SAX column of dataset2 . Is there a function/method for accomplishing the above?

Thanks.

Let's create an arbitrary dataset to showcase how it works:

import pandas as pd
import numpy as np

def sax_generator(num):
    return [''.join(chr(x) for x in np.random.randint(97, 97+26, size=4)) for _ in range(num)]

df1 = pd.DataFrame(sax_generator(10), columns=['sax'])
df2 = pd.DataFrame({'sax': sax_generator(10), 'timestamp': range(10)})

Let's peek into the data:

df1 = 
|    | sax   |
|---:|:------|
|  0 | cvtj  |
|  1 | fmjy  |
|  2 | rjpi  |
|  3 | gwtv  |
|  4 | qhov  |
|  5 | uriu  |
|  6 | kpku  |
|  7 | xkop  |
|  8 | kzoe  |
|  9 | nydj  |

df2 =
|    | sax   |   timestamp |
|---:|:------|------------:|
|  0 | kzoe  |           0 |
|  1 | npyo  |           1 |
|  2 | uriu  |           2 |
|  3 | hodu  |           3 |
|  4 | rdko  |           4 |
|  5 | pspn  |           5 |
|  6 | qnut  |           6 |
|  7 | gtyz  |           7 |
|  8 | gfzs  |           8 |
|  9 | gcel  |           9 |

Now ensure we have some matching values in df2 from df1 , which we can later check:

df2['sax'][2] = df1['sax'][5]
df2['sax'][0] = df1['sax'][8]

Then use:

df2.loc[df1.sax.apply(lambda x: df2.sax.str.contains(x)).any(), 'timestamp']

to get:

|    |   timestamp |
|---:|------------:|
|  0 |           0 |
|  2 |           2 |

With np.where docs here you can get the indices back as well:

np.where(df1.sax.apply(lambda x: df2.sax.str.contains(x)) == True)
# -> (array([5, 8]), array([2, 0]))

Here we can see that df1 has matching indices [5, 8] and df2 has [2, 0] , which is exactly what we enforced with the lines above... If we have a look at the return of df1.sax.apply(lambda x: df2.sax.str.contains(x)) , the result above matches exactly the indices (magic...whooo):

|    |   0 |   1 |   2 |   3 |   4 |   5 |   6 |   7 |   8 |   9 |
|---:|----:|----:|----:|----:|----:|----:|----:|----:|----:|----:|
|  0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
|  1 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
|  2 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
|  3 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
|  4 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
|  5 |   0 |   0 |   1 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
|  6 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
|  7 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
|  8 |   1 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
|  9 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |

Step1: Convert Dataset 2 to a dict using: import numpy as np import pandas as pd

a_dictionary = df.to_dict['list]

Step2: Use a comparator in a for loop to extract time stamps.

lookup_value = "abcdef" #This can be a list item.

all_keys = []
for key, value in a_dictionary.items():
    if(value == lookup_value):
         all_keys.append(key)

print(all_keys)

Step3: ENJOY!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM