简体   繁体   中英

Semantic Similarity between Sentences in a Text

I have used material from here and a previous forum page to write some code for a program that will automatically calculate the semantic similarity between consecutive sentences across a whole text. Here it is;

The code for the first part is copy pasted from the first link, then I have this stuff below which I put in after the 245 line. I removed all excess after line 245.

with open ("File_Name", "r") as sentence_file:
    while x and y:
        x = sentence_file.readline()
        y = sentence_file.readline()
        similarity(x, y, true)           
#boolean set to false or true 
        x = y
        y = sentence_file.readline() 

My text file is formatted like this;

Red alcoholic drink. Fresh orange juice. An English dictionary. The Yellow Wallpaper.

In the end I want to display all the pairs of consecutive sentences with the similarity next to it, like this;

["Red alcoholic drink.", "Fresh orange juice.", 0.611],

["Fresh orange juice.", "An English dictionary.", 0.0]

["An English dictionary.", "The Yellow Wallpaper.",  0.5]

if norm(vec_1) > 0 and if norm(vec_2) > 0:
    return np.dot(vec_1, vec_2.T) / (np.linalg.norm(vec_1)* np.linalg.norm(vec_2))
 elif norm(vec_1) < 0 and if norm(vec_2) < 0:
    ???Move On???

This should work. There's a few things to note in the comments. Basically, you can loop through the lines in the file and store the results as you go. One way to process two lines at a time is to set up an "infinite loop" and check the last line we've read to see if we've hit the end ( readline() will return None at the end of a file).

# You'll probably need the file extention (.txt or whatever) in open as well
with open ("File_Name.txt", "r") as sentence_file:
    # Initialize a list to hold the results
    results = []

    # Loop until we hit the end of the file
    while True:
        # Read two lines
        x = sentence_file.readline()
        y = sentence_file.readline()

        # Check if we've reached the end of the file, if so, we're done
        if not y:
            # Break out of the infinite loop
            break
        else:
            # The .rstrip('\n') removes the newline character from each line
            x = x.rstrip('\n')
            y = y.rstrip('\n')

            try: 
                # Calculate your similarity value
                similarity_value = similarity(x, y, True)

                # Add the two lines and similarity value to the results list
                results.append([x, y, similarity_value])
            except:
                print("Error when parsing lines:\n{}\n{}\n".format(x, y))

# Loop through the pairs in the results list and print them
for pair in results:
    print(pair)

Edit: In regards to issues you're getting from similarity() , if you want to simply ignore the line pairs that are causing these errors (without looking at the source in depth I really have no idea what's going on), you can add a try, catch around the call to similarity() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM