简体   繁体   中英

SQL Server: Select records which have any HTML entities within a VARCHAR(MAX) column

I have a table MyTable(id INT, stringText varchar(max)) with over 2 million records. I would like to write a query to analyse the number of rows which have any of the following characters (HTML Entities) within the text.

 
 

fi
fl
’
 
–
—
’
“
•
€
‚
ƒ
„
…
†
‡
ˆ
‰
Š
‹
Œ
Ž
‘
’
“
”
•
–
—
˜
™
š
›
œ
ž
Ÿ
¡
¢
£
¤
¥
¦
§
¨
©
ª
«
¬
®
¯
°
±
²
³
´
µ
¶
·
¸
¹
º
»
¼
½
¾
¿
À
Á
Â
Ã
Ä
Å
Æ
Ç
È
É
Ê
Ë
Ì
Í
Î
Ï
Ð
Ñ
Ò
Ó
Ô
Õ
Ö
×
Ø
Ù
Ú
Û
Ü
Ý
Þ
ß
à
á
â
ã
ä
å
æ
ç
è
é
ê
ë
ì

Could someone help me in writing an efficient WHERE CLAUSE to find out the number of rows, please?

I tried something like below, but it doesn't give me the expected results.

DECLARE @testStr AS VARCHAR(MAX) = 'testing - quote chars and others '+ '"' + ' '+ ' ' + '' + '- testing'
DECLARE @temp TABLE (string VARCHAR(MAX));
INSERT INTO @temp(string) VALUES ('testing - plain text'), (@testStr), ('testing' + CHAR(1) + CHAR(2) + CHAR(3) + CHAR(4)+ ' testing 1-4'), ('sathish' + CHAR(1) + ' testing - char 1'), ('sathish' + CHAR(3) + CHAR(4)+ ' testing - char 3-4')

SELECT * FROM @temp WHERE string LIKE '%[' + CHAR(1) + CHAR(2) + CHAR(3) + CHAR(4) + ']%' /* this where clause works fine, i.e. only returns the rows with any of those characters*/
SELECT * FROM @temp WHERE string LIKE '%[' + '"' + ' ' + ' ' + '' + ']%' /* this where clause doesn't work as expected, it is returning all rows*/

I assume the WHERE CLAUSE in my second SQL query didn't work because there was more than one character within in single quotes (which makes it a string rather than a character).

Thank you in advance.

Note: 1. The data is already in the database (please don't ask why this wasn't handled before saving into the database) and I cannot use SQL CLR functions, unfortunately.

  1. I would like to avoid multiple OR clauses something like below:
 SELECT * FROM @temp WHERE string LIKE '%"' OR string LIKE '% %' OR string LIKE '% %' OR string LIKE '5%' -- and so on 

You can try with all your entities as string elements in the IN part of the WHERE CLAUSE

SELECT COUNT(*) FROM @temp
WHERE SUBSTRING(
   string, 
   PATINDEX('%&%',string),
   PATINDEX('%;%',string) - PATINDEX('%&%',string) + 1
) IN ('É', 'Ê', 'Ë', 'Ì', 'Í',
 '...', '...' ,a.s.o)

or use a special entities-table, where you insert all your entities. Then your IN part would look like this

IN (SELECT entities FROM entities-table)

As I have no way to test, I can only propose a possible improvement, so that the found ';' is behind the found '&'

SELECT COUNT(*) FROM @temp
WHERE SUBSTRING(
   string, 
   PATINDEX('%&%',string),
   PATINDEX('%;%',  SUBSTRING(string, PATINDEX('%&%',string), max ) - PATINDEX('%&%',string) + 1
) IN ('É', 'Ê', 'Ë', 'Ì', 'Í',

What you need to do is use the OR in your WHERE clause, like this:

SELECT * FROM @temp 
WHERE string LIKE '%"%' 
OR string LIKE '% %' 
OR string LIKE '% %' -- etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM