[英]Fuzzy-match List of People
I am trying to see if a movie is the same between two pages, and to do so I would like to compare the Actors as one of the criteria. 我试图查看两页之间的电影是否相同,为此,我想将演员作为标准之一。 However, actors are often listed differently on different pages.
但是,演员通常在不同的页面上以不同的方式列出。 For example:
例如:
On this page, https://play.google.com/store/movies/details?id=cSdcb2KOH74 , the actors are listed as "Mikhail Galustyan, Danny Trejo, Guillermo Díaz, Oleg Taktarov, Kym Whitley, Christopher Robin Miller, Robert Bear, Vladimir Yaglych, Josh McLerran" 在此页面https://play.google.com/store/movies/details?id=cSdcb2KOH74上 ,演员被列出为“米哈伊尔·加卢斯蒂安,丹尼·特雷霍,吉列尔莫·迪亚兹,奥列格·塔克塔罗夫,凯姆·惠特利,克里斯托弗·罗宾·米勒,罗伯特熊,弗拉基米尔·雅格(Joseph McLerran)
One this page, http://www.imdb.com/title/tt2167970/ , the actors as "Ivan Stebunov, Ingrid Olerinskaya, Vladimir Yaglych" 一页, http://www.imdb.com/title/tt2167970/ ,演员是“伊万·斯特布诺夫(Ivan Stebunov),英格丽(Ingrid Olerinskaya),弗拉基米尔·雅格(Vladimir Yaglych)”
Previously, I was doing a very rough match on: 以前,我在以下方面做过非常粗略的匹配:
if actors_from_site_1[0] == actors_from_site_2[0]
But, as you can see from the above case, this isn't a good technique. 但是,从上述情况可以看出,这不是一个好方法。 What would be a better technique to see if the actors from one film match the others?
看看一部电影中的演员是否与其他演员匹配的更好的技术是什么?
You could check the length of a set intersection of the two sets of actors. 您可以检查两组参与者的集合交集的长度。
if len(set(actors_from_site_1).intersection(set(actors_from_site_2))):
or you could do something like: 或者您可以执行以下操作:
if any(actor in actors_from_site_1 for actor in actors_from_site_2):
If all the lists are comma separated actor names, split them on the commas, lowercase the names, and get the intersection: 如果所有列表都是用逗号分隔的演员名称,请在逗号上将它们分开,小写名称,然后得到交集:
actors_from_site_1 = set(actors_from_site_1.lower().split(','))
actors_from_site_2 = set(actors_from_site_2.lower().split(','))
common_actors = actors_from_site_1 & actors_from_site_2
Try: 尝试:
similaractors = []
for actor in actors_from_site_1:
if actor in actors_from_site_2:
similaractors.append(actor)
Then, you have similaractors
as a list of all the actors they share. 然后,您将
similaractors
的演员作为他们共享的所有演员的列表。 Call len(similaractors)
to get the number of similar actors, and then you can print(similaractors)
and do everything else you might do with a list. 调用
len(similaractors)
以获得相似角色的数量,然后可以print(similaractors)
相似角色print(similaractors)
并执行列表可能要做的所有其他事情。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.