I have a csv file which contains only one column that looks like this df1
Col_A
Name
Address
[B00-OUI_001]
Soemthing else
etc.
and another that has something like this.
df2
Col_B
[B00-OUI_000_V]
[B00-OUI_002_V]
[B00-OUI_003_V]
[B00-OUI_001_V]
[B00-OUI_005_V]
[B00-OUI_006_V]
[B00-OUI_007_V]
I am trying to find out matching entries from df2 in df1 like B00-OUI_001
is in both df but in df2 its with _V
, so it turned to regular expression as everything is in string format, but have been failing in exact match. Can someone help me in this?
You can remove trailing []
in both columns and filter with Series.str.startswith
with tuples:
tups = tuple(df1['Col_A'].str.strip('[]').unique())
df2 = df2[df2['Col_B'].str.strip('[]').str.startswith(tups)]
print (df2)
Col_B
3 [B00OUI_001_V]
Another idea is join unique values by |
for regex OR
and use Series.str.contains
:
v = '|'.join(df1['Col_A'].str.strip('[]').unique())
df2 = df2[df2['Col_B'].str.strip('[]').str.contains(v)]
print (df2)
Col_B
3 [B00OUI_001_V]
If it's only "_V" that can disrupt exact match, why not get rid of it and create a dummy column index? Exact join always will be faster than any kid of regex-mapping.
What I mean:
df2["Col_B_edt"]=df2["Col_B"].str.replace("_V]", "]")
df3=pd.merge(df,df2,left_on="Col_A",right_on="Col_B_edt").drop("Col_B_edt", axis=1)
Output:
Col_A Col_B
0 [B00-OUI_001] [B00-OUI_001_V]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.