简体   繁体   中英

How to find an array in string format in pandas using regular expression?

I have a csv file which contains only one column that looks like this df1

Col_A
Name
Address
[B00-OUI_001]
Soemthing else
etc.

and another that has something like this.

df2

Col_B
[B00-OUI_000_V]
[B00-OUI_002_V]
[B00-OUI_003_V] 
[B00-OUI_001_V]
[B00-OUI_005_V]
[B00-OUI_006_V]
[B00-OUI_007_V]

I am trying to find out matching entries from df2 in df1 like B00-OUI_001 is in both df but in df2 its with _V , so it turned to regular expression as everything is in string format, but have been failing in exact match. Can someone help me in this?

You can remove trailing [] in both columns and filter with Series.str.startswith with tuples:

tups = tuple(df1['Col_A'].str.strip('[]').unique())

df2 = df2[df2['Col_B'].str.strip('[]').str.startswith(tups)]
print (df2)
            Col_B
3  [B00OUI_001_V]

Another idea is join unique values by | for regex OR and use Series.str.contains :

v = '|'.join(df1['Col_A'].str.strip('[]').unique())

df2 = df2[df2['Col_B'].str.strip('[]').str.contains(v)]
print (df2)
            Col_B
3  [B00OUI_001_V]

If it's only "_V" that can disrupt exact match, why not get rid of it and create a dummy column index? Exact join always will be faster than any kid of regex-mapping.

What I mean:

df2["Col_B_edt"]=df2["Col_B"].str.replace("_V]", "]")

df3=pd.merge(df,df2,left_on="Col_A",right_on="Col_B_edt").drop("Col_B_edt", axis=1)

Output:

   Col_A          Col_B
0  [B00-OUI_001]  [B00-OUI_001_V]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM