Pattern comparing with mysql between two tables column

Question

One simple question is preg_match in PHP and like in mysql query are same?

Main Question:

Consider Following are my two tables table1 and table2

Table 1 Table 2

+-------+-------------------------+      +-------+------------------------------+
| ID    | Model                   |      | ID    | Model                        |
+-------+-------------------------+      +-------+------------------------------+
| 1     | iPad 2 WiFi 16GB        |      | 1     | iPad2 WiFi 16GB              |
| 2     | iPhone 4S 16GB          |      | 2     | iPhone4S 16GB                |
| 3     | iPod Touch(4th Gen)8GB  |      | 3     |iPod Touch 4th Generation 8GB |
+-------+-------------------------+      +-------+------------------------------+

Now what i wanna do is to compare these two tables as you can see iPad 2 WiFi 16GB and iPad2 WiFi 16GB or iPod Touch(4th Gen)8GB and iPod Touch 4th Generation 8GB both are the same but it doesnot show if i put in my query where Table1.model = Table2.model because they are not the exact match. What I wanna do is to compare these rows with mysql query by using like or anyother way so it'll compare the both table rows which are alike. Kindly let me know how to write such sql query.

I tried the following sql query but it didnot return all the rows like it didnot return those type of rows that are mentioned in the above example.

SELECT table1.model as model1, table2.model as model2
FROM table1,table2 WHERE table1.model REGEXP table2.model

Answer 1

Two questions - are the descriptions standard (descriptions don't change) or are they entered by a user? If they're standard, add a column that is an integer and do comparison on this column.

If its entered by the user, your work is more complicated because you're looking for something that is more fuzzy search. I used a bi-gram search algorithm to rank similarity between two strings, but this can't be done directly in mySQL.

In lieu of a fuzzy search, you could use LIKE, but it's efficiency is limited to doing table scan's if you end up putting the '%' in the beginning of the search term. Also, it implies you can get a match on the substring portion you choose, meaning you'd need to know the substring ahead of time.

I'd be happy to elaborate more once I know what you're trying to do.

EDIT1: Ok, given your elaboration, you will need to do a fuzzy style search as I mentioned. I use a bi-gram method, which involves taking each entry made by user and splitting it into chunks of 2 or 3 characters. I then store each of these chunks in another table with each entry keyed back to the actual description.

Example:

Description1: "A fast run forward" Description2: "A short run forward"

If you break each into 2 char chunks - 'A ', ' f', 'fa', 'as','st'.....

Then you can compare the number of 2 char chunks that match both strings and get a "score" which will connote accuracy or similarity between the two.

Given I don't know what development language you're using, I'll leave the implementation out, but this is something that will need to be done not explicitly in mySQL.

Or the lazy alternative would be to use a cloud search service like Amazon has that will provide search based on terms you give it...not sure if they allow you to continously add new descriptions to consider though, and depending on your application, it can be a bit costly (IMHO).

R

For another SO post on the bigram implementation - see this SO bigram / fuzzy search

--- Update per questioner elaboration---

First, I'm assuming you read the theory on the links I provided..second, I'll try to keep it as DB agnostic as possible, since it doesn't need mySQL (though I use it, and it works more than fine)

Ok, so the bigram method works ok in making/comparing in-memory arrays only if the possible matches are relatively small, otherwise it suffers from a table-scan performance like a mysql table without indexes fairly quickly. So, you're going to use the database strengths to help do the indexing for you.

What you need is one table to hold the user entered "terms" or text that you're looking to compare. The simplest form is a table with two columns, one is a unique auto-increment integer which will be indexed, we'll call hd_id below, the second is a varchar(255) if the strings are pretty short, or TEXT if they can get long - you can name this whatever you want.

Then, you'll need to make another table that has at least THREE columns - one for the reference column back to the other table's auto-incremented column (we'll call this hd_id below), the second would be a varchar() of say 5 chars at most (this will hold your bigram chunks) which we'll call "bigram" below, and the third an auto-incrementing column called b_id below. This table will hold all the bigrams for each user's entry and tie back to the overall entry. You'll want to index the varchar column by itself (or first in order in a compound index).

Now, every time a user enters a term you want to search, you need to enter the term in the first table, then dissect the term it into bigrams and enter each chunk into the second table using the reference back to the overall term in the first table to complete the relationship. This way, you're doing the dissection in PHP, but letting mySQL or whatever database do the index optimization for you. It may help in the bigram phase to store the number of bigrams made in table 1 for the calculation phase. Below is some code in PHP to give you an idea on how to create the bigrams:

// split the string into len-character segments and store seperately in array slots
function get_bigrams($theString,$len)   
{
   $s=strtolower($theString);
   $v=array();
   $slength=strlen($s)-($len-1);     // we stop short of $len-1 so we don't make short chunks as we run out of characters

   for($m=0;$m<$slength;$m++)
   {
      $v[]=substr($s,$m,$len);
   }
   return $v;
}

Don't worry about spaces in the strings - they're actually really helpful if you think about fuzzy search.

So you get the bigrams, enter them in a table, linked to the overall text in table 1 via and indexed column...now what?

Now whenever you search for a term such as "My favorite term to search for" - you can use the php function to turn it into an array of bigrams. You then use this to create the IN (..) part of a SQL statement on your bigram table(2). Below is an example:

select count(b_id) as matches,a.hd_id,description, from table2 a
inner join table1 b on (a.hd_id=b.hd_id)
where bigram in (" . $sqlstr . ")
group by hd_id order by matches desc limit X

I've left the $sqlstr as a PHP string reference - you could construct this yourself as a comma separated list from the bigram function using implode or whatever on the array returned from get_bigrams or parameterize if you like too.

If done correctly, the query above returns the most closely matched fuzzy search terms depending on the length of the bigram you chose. The length you choose has a relative efficacy based on your expected length of the overall search strings.

Lastly - the query above, just gives a fuzzy match rank. You can play around with and enhance by comparing not just matches, but matches vs. overall bigram count which will help de-bias long search strings compared to short strings. I've stopped here because at this juncture it becomes much more application specific.

Hope this helps!

R

Pattern comparing with mysql between two tables column

Question

1 answers

solution1
1 2012-11-16 13:13:02

Pattern comparing with mysql between two tables column

Question

1 answers

solution1 1 2012-11-16 13:13:02

solution1
1 2012-11-16 13:13:02