简体   繁体   English

MySQL正则表达式在文本内查找不同的字符串

[英]MySql regex to find different strings inside a text

I have a site where it is possible for the users to create sub-sites with their own content. 我有一个网站,用户可以在其中创建具有自己内容的子网站。 It is something like wix.com. 就像wix.com。 It is possible to create links in the content and some users are abusing this functionality to link to malware sites. 可以在内容中创建链接,并且某些用户正在滥用此功能以链接到恶意软件站点。

The user's contents are stored in a MySQL database, in a table called pages , inside of the column content . 用户的内容存储在MySQL数据库中的列content内的一个名为pages的表中。

I would like to find every content that have strings that begin with "http" but do not contain one of my two domains (let's say they are mysite.com and another.com ). 我想找到包含以“ http”开头但不包含我的两个域之一的字符串的所有内容(假设它们是mysite.comanother.com )。 It would help because almost every content contain links to these two sites, but very few contain links to other sites. 这将有所帮助,因为几乎每个内容都包含指向这两个站点的链接,但是很少包含对其他站点的链接。

For example: I would like to catch http://badsite.com but I would not want to catch http://subdomain.mysite.com/page1 or http://name.another.com/?page=products 例如:我想捕获http://badsite.com但我不想捕获http://subdomain.mysite.com/page1http://name.another.com/?page=products

Also, I would like to catch http://badsite.com even if the text also contains a link to one of my domains (for example http://sub.mysite.com/ ). 另外,即使文本中也包含指向我的一个域的链接(例如http://sub.mysite.com/ ),我也想捕获http://badsite.com For this reason, the query below would not work: 因此,下面的查询将不起作用:

select * 
from pages
where content like '%http%'
  and content not like '%mysite.com%'
  and content not like '%another.com%'

Example of text that I would like to catch : 想捕捉的文字示例:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, < a href=" http://sub.mysite.com/ ">sed< /a> do eiusmod < a href=" http://badsite.com ">tempor< /a> incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor就座,积极奉献精英,<a href =“ http://sub.mysite.com/”> sed </ a> do eiusmod <a href =“ http://badsite.com”> tempor < / a>劳动者和劳动者分会。

Example of text that I would not like to catch : 不希望看到的文字示例:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, < a href=" http://sub.mysite.com/ ">sed< /a> do eiusmod < a href=" http://prefix.another.com/page2 ">tempor< /a> incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor就座,安全地奉献精英,<a href =“ http://sub.mysite.com/”> sed </ a>做eiusmod <a href =“ http://prefix.another.com/page2 “>临时工/劳动者禁忌令。

In short, I'd like to find all pages that link to any domain other than mysite.com or another.com. 简而言之,我想找到链接到除mysite.com或another.com以外的任何域的所有页面。

I think that I will have to use regex for this, but I don't know how to do it. 我认为我必须为此使用正则表达式,但是我不知道该怎么做。

Check this section: https://dev.mysql.com/doc/refman/5.7/en/regexp.html 检查本节: https : //dev.mysql.com/doc/refman/5.7/zh-CN/regexp.html

As for condition combination, consider parentheses with logical operations, they're going to help you express whatever you want, like: 至于条件组合,请考虑带有逻辑运算符的括号,它们将帮助您表达所需内容,例如:

(cond1 OR cond2) AND NOT cond3 AND cond 4 ... et cetera, et cetera (cond1 OR cond2) AND NOT cond3 AND cond 4 ...等,等等

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM