How to find an array in string format in pandas using regular expression?

Question

I have a csv file which contains only one column that looks like this df1

Col_A
Name
Address
[B00-OUI_001]
Soemthing else
etc.

and another that has something like this.

df2

Col_B
[B00-OUI_000_V]
[B00-OUI_002_V]
[B00-OUI_003_V] 
[B00-OUI_001_V]
[B00-OUI_005_V]
[B00-OUI_006_V]
[B00-OUI_007_V]

I am trying to find out matching entries from df2 in df1 like B00-OUI_001 is in both df but in df2 its with _V , so it turned to regular expression as everything is in string format, but have been failing in exact match. Can someone help me in this?

Answer 1

You can remove trailing [] in both columns and filter with Series.str.startswith with tuples:

tups = tuple(df1['Col_A'].str.strip('[]').unique())

df2 = df2[df2['Col_B'].str.strip('[]').str.startswith(tups)]
print (df2)
            Col_B
3  [B00OUI_001_V]

Another idea is join unique values by | for regex OR and use Series.str.contains :

v = '|'.join(df1['Col_A'].str.strip('[]').unique())

df2 = df2[df2['Col_B'].str.strip('[]').str.contains(v)]
print (df2)
            Col_B
3  [B00OUI_001_V]

Answer 2

If it's only "_V" that can disrupt exact match, why not get rid of it and create a dummy column index? Exact join always will be faster than any kid of regex-mapping.

What I mean:

df2["Col_B_edt"]=df2["Col_B"].str.replace("_V]", "]")

df3=pd.merge(df,df2,left_on="Col_A",right_on="Col_B_edt").drop("Col_B_edt", axis=1)

Output:

   Col_A          Col_B
0  [B00-OUI_001]  [B00-OUI_001_V]

How to find an array in string format in pandas using regular expression?

Question

2 answers

solution1
2 ACCPTED 2019-12-19 08:22:22

solution2
1 2019-12-19 08:46:47

How to find an array in string format in pandas using regular expression?

Question

2 answers

solution1 2 ACCPTED 2019-12-19 08:22:22

solution2 1 2019-12-19 08:46:47

solution1
2 ACCPTED 2019-12-19 08:22:22

solution2
1 2019-12-19 08:46:47