简体   繁体   中英

Comparing two strings for similarity in PHP

I'm trying to find the best solution for comparing two similar strings and choosing the most similar it can find.

I have an array of straight movie names. I also have an array of movie names with additional text.

Example:

My straight movie name array contains strings like so:

"Super Troopers", 
"Everest", 
"Star Wars: Episode I The Phantom Menace"

My other array with movie strings are in forms similar to the following:

"Super Troopers (2001) 720P-AC3-x264", 
"Everest - 2015.1080p.DTS mkv", 
"Star Wars - Episode 1: The Phantom Menace 1080p h265 HEVC TrueHD"

What I'm currently doing is looping through my first array comparing each movie with the second array and using strpos() If I find an exact match, great. If not I need to perform some other function to look for which two strings are most similar. I have tried using similar_text() and levenshtein() with mixed results.

In my above examples, strpos() would have matched both Everest and Super Troopers just fine, but for the Star Wars string I need additional checks. Things like hyphens and colons and "I" and "1" used differently along with the additional information that follows the movie name seem to give me sporadic results with similar_text() and levenshtein()

I'm thinking of maybe first substring out the movie names with the additional information by first calculating the strlen() of the movie name plus 5 or so additional characters for good measure before running a similar_text() or levenshtein() function/s, since the one common thing they all have is their movie names are at the start of the string. This could make the string similarity functions maybe a bit more accurate?

Or maybe some function that breaks up each word and checks to see how many match in the other string. Does such a function exist?

I'll mess around with it more, but if anyone has any input on how they might tackle the problem, I'd love to know.

Thanks.

I have an idea for an interesting solution. It uses a database. Every time you get a new Movie in your collection, you separate the movie name into words. For instance:

"Star Wars: Episode I The Phantom Menace"

would be separated into:

"Star", "Wars:", "Episode", "I", "The", "Phantom", "Menace"

From there, you would have the following tables in your database:

CREATE TABLE movie_search (
movie_keyword varchar(255) NOT NULL,
movie_id INT NOT NULL,
PRIMARY KEY (movie_keyword)
)

CREATE TABLE movies (
movie_id INT NOT NULL AUTO_INCREMENT,
movie_name varchar(255) NOT NULL,
PRIMARY KEY (movie_id)
)

Example of the movie_search table:

key_word | movie_id
star -------- 1
wars -------- 1
spider ------ 2
man --------- 2

Example of the movies table:

movie_id | movie_name
1 -------- star wars
2 -------- spider man

Every time someone wants to search for a movie in your website, you would break their phrase into all the words using explode(" ", $searched_name); . From there you would search in your database all the matching key_word matchs in the movie_search table, and if the movie_id repeated, you would be able to increase the count of keyword matches you found for each movie. So after having done a search with some good PHP behind it, your result should be a multidimentional array with 3 elements in each row:

array (
  [0] => array (
    [movie_id] = 1,
    [movie_name] = star wars,
    [count] = 2),
  [1] => array (...),
    ....
)

where the movie with the most keywords (highest count) would appear at the top of your array. You can also decide how many results you want to output by placing "ORDER BY 10" in your SQL code

HOPE THAT HELPS! :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM