[英]python - "merge based on a partial match" - Improving performance of function
我有以下脚本 - 旨在创建“基于部分匹配的合并”功能,因为据我所知,这对于普通的.merge()
功能是不可能的。
下面的工作/返回所需的结果,但不幸的是,它非常慢,以至于在我需要它的地方几乎无法使用。
一直在查看其他包含类似问题的 Stack Overflow 帖子,但还没有找到更快的解决方案。
任何关于如何实现这一点的想法将不胜感激!
import pandas as pd
df1 = pd.DataFrame([ 'https://wwww.example.com/hi', 'https://wwww.example.com/tri', 'https://wwww.example.com/bi', 'https://wwww.example.com/hihibi' ]
,columns = ['pages']
)
df2 = pd.DataFrame(['hi','bi','geo']
,columns = ['ngrams']
)
def join_on_partial_match(full_values=None, matching_criteria=None):
# Changing columns name with index number
full_values.columns.values[0] = "full"
matching_criteria.columns.values[0] = "ngram_match"
# Creating matching column so all rows match on join
full_values['join'] = 1
matching_criteria['join'] = 1
dfFull = full_values.merge(matching_criteria, on='join').drop('join', axis=1)
# Dropping the 'join' column we created to join the 2 tables
matching_criteria = matching_criteria.drop('join', axis=1)
# identifying matching and returning bool values based on whether match exists
dfFull['match'] = dfFull.apply(lambda x: x.full.find(x.ngram_match), axis=1).ge(0)
# filtering dataset to only 'True' rows
final = dfFull[dfFull['match'] == True]
final = final.drop('match', axis=1)
return final
join = join_on_partial_match(full_values=df1,matching_criteria=df2)
print(join)
>> full ngram_match
0 https://wwww.example.com/hi hi
7 https://wwww.example.com/bi bi
9 https://wwww.example.com/hihibi hi
10 https://wwww.example.com/hihibi bi
对于任何有兴趣的人 - 最终想出了两种方法来做到这一点。
def partial_match_join_all_matches_returned(full_values=None, matching_criteria=None):
"""The partial_match_join_first_match_returned() function takes two series objects and returns a dataframe with all matching values (duplicating the full value).
Args:
full_values = None: This is the series that contains the full values for matching pair.
partial_values = None: This is the series that contains the partial values for matching pair.
Returns:
A dataframe with 2 columns - 'full' and 'match'.
"""
start_join1 = time.time()
matching_criteria = matching_criteria.to_frame("match")
full_values = full_values.to_frame("full")
full_values = full_values.drop_duplicates()
output=[]
for n in matching_criteria['match']:
mask = full_values['full'].str.contains(n, case=False, na=False)
df = full_values[mask]
df_copy = df.copy()
df_copy['match'] = n
# df = df.loc[n, 'match']
output.append(df_copy)
final = pd.concat(output)
end_join1 = (time.time() - start_join1)
end_join1 = str(round(end_join1, 2))
len_join1 = len(final)
return final
def partial_match_join_first_match_returned(full_values=None, matching_criteria=None):
"""The partial_match_join_first_match_returned() function takes two series objects and returns a dataframe with the first matching value.
Args:
full_values = None: This is the series that contains the full values for matching pair.
partial_values = None: This is the series that contains the partial values for matching pair.
Returns:
A dataframe with 2 columns - 'full' and 'match'.
"""
start_singlejoin = time.time()
matching_criteria = matching_criteria.to_frame("match")
full_values = full_values.to_frame("full").drop_duplicates()
output=[]
for n in matching_criteria['match']:
mask = full_values['full'].str.contains(n, case=False, na=False)
df = full_values[mask]
df_copy = df.copy()
df_copy['match'] = n
# leaves us with only the 1st of each URL
df_copy.drop_duplicates(subset=['full'])
output.append(df_copy)
final = pd.concat(output)
end_singlejoin = (time.time() - start_singlejoin)
end_singlejoin = str(round(end_singlejoin, 2))
len_singlejoin = len(final)
return final
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.