简体   繁体   English

Python中的模糊URL匹配

[英]Fuzzy URL matching in Python

I'd like to find a tool that does a good job of fuzzy matching URLs that are the same expecting extra parameters. 我想找到一种可以很好地处理模糊匹配URL的工具,这些URL相同,但需要额外的参数。 For instance, for my use case, these two URLs are the same: 例如,对于我的用例,这两个URL是相同的:

atest = (http://www.npr.org/templates/story/story.php?storyId=4231170', 'http://www.npr.org/templates/story/story.php?storyId=4231170&sc=fb&cc=fp)

At first blush, fuzz.partial_ratio and fuzz.token_set_ratio fuzzywuzzy get the job done with a 100 threshold: 乍一看, fuzz.partial_ratiofuzz.token_set_ratio Fuzzywuzzy会以100个阈值完成工作:

ratio = fuzz.ratio(atest[0], atest[1])
partialratio = fuzz.partial_ratio(atest[0], atest[1])
sortratio = fuzz.token_sort_ratio(atest[0], atest[1])
setratio = fuzz.token_set_ratio(atest[0], atest[1])
print('ratio: %s' % (ratio))
print('partialratio: %s' % (partialratio))
print('sortratio: %s' % (sortratio))
print('setratio: %s' % (setratio))
>>>ratio: 83
>>>partialratio: 100
>>>sortratio: 83
>>>setratio: 100

But this approach fails and returns 100 in other cases, like: 但是此方法失败,在其他情况下返回100,例如:

atest('yahoo.com','http://finance.yahoo.com/news/earnings-preview-monsanto-report-2q-174000816.html')

The URLs in my data and the parameters added vary a great deal. 我的数据中的URL和添加的参数相差很大。 I interested to know if anyone has a better approach using url parsing or similar? 我想知道是否有人使用URL解析或类似方法有更好的方法?

If all you want is check that all query parameters in the first URL are present in the second URL, you can do it in a simpler way by just doing set difference: 如果您只需要检查第二个URL中是否存在第一个URL中的所有查询参数,则可以通过设置差异来以一种更简单的方式进行操作:

import urllib.parse as urlparse

base_url = 'http://www.npr.org/templates/story/story.php?storyId=4231170'
check_url = 'http://www.npr.org/templates/story/story.php?storyId=4231170&sc=fb&cc=fp'

base_url_parameters = set(urlparse.parse_qs(urlparse.urlparse(base_url).query).keys())
check_url_parameters = set(urlparse.parse_qs(urlparse.urlparse(check_url).query).keys())

print(base_url_parameters - check_url_parameters)

This will return an empty set, but if you change the base url to something like 这将返回一个空集,但是如果您将基本网址更改为类似

base_url = 'http://www.npr.org/templates/story/story.php?storyId=4231170&test=1'

it will return {'test'} , which means that there are extra parameters in the base URL that are missing from the second URL. 它将返回{'test'} ,这表示第二个URL中缺少基本URL中的其他参数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM