简体   繁体   中英

Python compute cosine similarity on two directories of files

I have two directories of files. One contains human-transcribed files and the other contains IBM Watson transcribed files. Both directories have the same number of files, and both were transcribed from the same telephony recordings.

I'm computing cosine similarity using SpaCy's .similarity between the matching files and print or store the result along with the compared file names. I have attempted using a function to iterate through in addition to for loops but cannot find a way to iterate between both directories, compare the two files with a matching index, and print the result.

Here's my current code:

# iterate through files in both directories
for human_file, api_file in os.listdir(human_directory), os.listdir(api_directory):
    # set the documents to be compared and parse them through the small spacy nlp model
    human_model = nlp_small(open(human_file).read())
    api_model = nlp_small(open(api_file).read())
    
    # print similarity score with the names of the compared files
    print("Similarity using small model:", human_file, api_file, human_model.similarity(api_model))

I've gotten it to work with iterating through just one directory and checked that it has the expected output by printing the file name, but it doesn't work when using both directories. I've also tried something like this:

# define directories
human_directory = os.listdir("./00_data/Human Transcripts")
api_directory = os.listdir("./00_data/Watson Scripts")

# function for cosine similarity of files in two directories using small model
def nlp_small(human_directory, api_directory):
    for i in (0, (len(human_directory) - 1)):
        print(human_directory[i], api_directory[i])

nlp_small(human_directory, api_directory)

Which returns:

human_10.txt watson_10.csv
human_9.txt watson_9.csv

But that's only two of the files, not all 17 of them.

Any pointers on iterating through a matching index on both directories would be much appreciated.

Two minor errors that's preventing you from looping through. For the second example, in the for loop you're only looping through index 0 and index (len(human_directory) - 1)). Instead, you should do for i in range(len(human_directory)): That should allow you to loop through both.

For the first, I think you might get some kind of too many values to unpack error . To loop through two iterables concurrently, use zip(), so it should look like

for human_file, api_file in zip(os.listdir(human_directory), os.listdir(api_directory)):

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM