简体   繁体   English

查找两个文件之间的子字符串匹配

[英]Find Substring Matches Between Two Files

I have a list of movie titles and a list of names. 我有电影标题列表和名称列表。

Movies: 电影:

  • Independence Day 独立日
  • Who Framed Roger Rabbit 谁陷害了罗杰·兔子
  • Rosemary's Baby 迷迭香的宝贝
  • Ghostbusters 捉鬼敢死队
  • There's Something About Mary 玛丽有事

Names: 名称:

  • Roger 罗杰
  • Kyle 凯尔
  • Mary 玛丽
  • Sam 山姆

I want to make a new list of all the movies that match a name from the names list. 我想为所有与名称列表中的名称匹配的电影制作一个新列表。

  • Who Framed Roger Rabbit (matched "roger") 谁陷害了罗杰·兔子(匹配“罗杰”)
  • Rosemary's Baby (matched "mary") 罗斯玛丽的宝贝(与“玛丽”搭配)
  • There's Something About Mary (matched "mary") 关于玛丽的事(匹配“玛丽”)

I've tried to do this in Python, but for some reason it isn't working. 我曾尝试在Python中执行此操作,但由于某种原因它无法正常工作。 The resulting file is empty. 结果文件为空。

with open("movies.csv", "r") as movieList:
    movies = movieList.readlines()

with open("names.txt", "r") as namesToCheck:
    names = namesToCheck.readlines()

with open("matches.csv", "w") as matches:
    matches.truncate(0)

    for i in range(len(movies)):
        for j in range(len(names)):
            if names[j].lower() in movies[i].lower():
                matches.write(movies[i])
                break

    matches.close();

What am I missing here? 我在这里想念什么?

The reason that you aren't getting any results is likely that when you call readlines() on a file in Python it gives you a list of each line with a newline character, \\n , attached to the end. 无法获得任何结果的原因可能是,当您在Python中的文件上调用readlines() ,它会为您提供每行的列表,并在末尾附加换行符\\n Therefore your program would be checking if "roger\\n" is in a line in the movies files rather than just "roger" . 因此,您的程序将检查"roger\\n"是否在电影文件的一行中,而不只是"roger"

To fix this, you could simply add a [:-1] to your if statement to only check the name and not the newline: 要解决此问题,您只需在if语句中添加[:-1]即可仅检查名称,而不检查换行符:

if names[j].lower()[:-1] in movies[i].lower():

You could also change the way you read the names file by using read().splitlines() to get rid of the newline character like this: 您还可以通过使用read().splitlines()来更改换行字符的方式来读取名称文件,如下所示:

names = namesToCheck.read().splitlines()

This works .... 这有效..

Movies="""Independence Day
Who Framed Roger Rabbit
Rosemary's Baby
Ghostbusters
There's Something About Mary
"""

Names="""Roger
Kyle
Mary
Sam"""

with StringIO(Movies) as movie_file:
    movies=[n.strip().lower() for n in movie_file.readlines()]
with StringIO(Names) as name_file:
    names=[n.strip().lower() for n in name_file.readlines()]

for name in names:
    for film in movies:
        if film.find(name) is not -1:
            print("{:20s} {:40s}".format(name,film))

Output: 输出:

roger who framed roger rabbit 陷害罗杰兔子的罗杰
mary rosemary's baby 玛丽·罗斯玛丽的宝贝
mary there's something about mary 玛丽关于玛丽的事

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM