简体   繁体   中英

MySql regex to find different strings inside a text

I have a site where it is possible for the users to create sub-sites with their own content. It is something like wix.com. It is possible to create links in the content and some users are abusing this functionality to link to malware sites.

The user's contents are stored in a MySQL database, in a table called pages , inside of the column content .

I would like to find every content that have strings that begin with "http" but do not contain one of my two domains (let's say they are mysite.com and another.com ). It would help because almost every content contain links to these two sites, but very few contain links to other sites.

For example: I would like to catch http://badsite.com but I would not want to catch http://subdomain.mysite.com/page1 or http://name.another.com/?page=products

Also, I would like to catch http://badsite.com even if the text also contains a link to one of my domains (for example http://sub.mysite.com/ ). For this reason, the query below would not work:

select * 
from pages
where content like '%http%'
  and content not like '%mysite.com%'
  and content not like '%another.com%'

Example of text that I would like to catch :

Lorem ipsum dolor sit amet, consectetur adipiscing elit, < a href=" http://sub.mysite.com/ ">sed< /a> do eiusmod < a href=" http://badsite.com ">tempor< /a> incididunt ut labore et dolore magna aliqua.

Example of text that I would not like to catch :

Lorem ipsum dolor sit amet, consectetur adipiscing elit, < a href=" http://sub.mysite.com/ ">sed< /a> do eiusmod < a href=" http://prefix.another.com/page2 ">tempor< /a> incididunt ut labore et dolore magna aliqua.

In short, I'd like to find all pages that link to any domain other than mysite.com or another.com.

I think that I will have to use regex for this, but I don't know how to do it.

Check this section: https://dev.mysql.com/doc/refman/5.7/en/regexp.html

As for condition combination, consider parentheses with logical operations, they're going to help you express whatever you want, like:

(cond1 OR cond2) AND NOT cond3 AND cond 4 ... et cetera, et cetera

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM