简体   繁体   中英

Algorithm for comparing phrases and sentences for the relevant meaning

I'm new to machine learning and would be very appreciative if you could give me a hint for the right direction of using a right/simple tool(s) for the algorithm of comparing any two phrases consisting of different words but having one meaning. Random example:

Phrase A:
"Solving mac computers operating system issues"

Phrase B:
"Fixing apple OS X errors"

The task is to analyze a massive quantity of phrases and sentences consisting of different words and reveal those that have one or a close to each other meaning.

I'd like to know if this actually possible and if so with what tools or programming languages and how it works.

If there exists an algorithm that uses a synonym dictionary for such a purpose?

How Google solved such a task if they ever had such a need? I know they parse and analyze tonnes of data, but how would they do in such a need?

Thank you!!

You can solve this problem by determining "semantic similarity" (the similarity of the meaning) between passages. Currently, the best way to do so is to leverage Deep Learning algorithms.

In specific, I've used the following library extensively: https://github.com/UKPLab/sentence-transformers

This library features models like BERT & XLNet that have been "fine-tuned" (adapted) to the task of projecting passages into a 768-dimensional vector space that represents the meaning of the input.

The idea is that the closer two resulting output vectors are to each other (cosine distance, manhattan distance, etc.), the closer their input passages are in meaning.

Here's a small code snippet that demonstrates how you can use this library:

import numpy as np

embedder = SentenceTransformer('bert-base-nli-mean-tokens')

def manhattan_distance(x, y):
    return np.sum(np.abs(x - y))

anchor_phrase = "Solving mac computers operating system issues"
candidate_phrases = [
    "Fixing apple OS X errors",
    "Troubleshooting iPhone problems"
]

embeddings = embedder.encode([anchor_phrase] + candidate_phrases)
anchor_embedding = embeddings[0]
candidates = list(zip(candidate_phrases, embeddings[1:]))
candidates = [(x[0], manhattan_distance(anchor_embedding, x[1])) for x in candidates]
print(candidates)

This should print [('Fixing apple OS X errors', 275.67545), ('Troubleshooting iPhone problems', 313.4759)] . When the distance (second item in each tuple) is lower, the sentence is more semantically similar to the anchor.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM