Count specific characters in a column

Question

I have a table with a list of titles. I am trying to figure out a way of creating a substring query that will let me count the number of times that a particular character occurs in the entire column. Such as, how many times does the letter 'A' occur? I am thinking of the substring since I want to know the count for letters A - I.

I need a new table that shows the substring letters (say AZ) and next to them the total number of times that letter occurs in the entire column (not just in each row).

Answer 1

For the basic ASCII letters like AZ (as mentioned) and a (typical) UTF-8 or LATIN* encoding (or most others):

SELECT chr(c) AS letter
     , sum(octet_length(col)
         - octet_length(translate(col, chr(c), ''))) AS total_count
FROM   generate_series (ascii('A'), ascii('Z')) c
CROSS  JOIN tbl
GROUP  BY 1;

translate() works for single-character replacements and is a bit faster than replace() - which you would use looking for multi-character strings.

In (typical) UTF-8 or LATIN* encoding, basic ASCII letters are represented with a single byte. This allows the faster function octet_length() . To count characters encoded with more bytes, use length() instead, which counts characters instead of bytes.

Also, we can conveniently generate a range of letters like AZ with generate_series() , because their byte-representation lines up in a continuous range in the mentioned encodings. Convert to integer with ascii() and back with chr() .

Then CROSS JOIN to your table ( tbl ), measure the difference between original length and after removing the letter of interest, and sum.

But while counting many of the characters in your strings, this alternative approach is probably much faster :

SELECT letter, count(*) AS total_count
FROM   tbl, unnest(string_to_array(col, NULL)) letter
WHERE  ascii(letter) BETWEEN ascii('A') AND ascii('Z')
GROUP  BY 1;

To count case-insensitive, throw in lower() or upper() :

FROM   tbl, unnest(string_to_array(upper(col), NULL)) letter

To check for multiple non-continuous ranges of characters:

WHERE  letter ~ '^[a-zA-Z]$'  -- a-z and A-Z separately (case-sensitive)

Or a random selection:

WHERE  'abcXYZ' ~ letter

string_to_array() with separator NULL splits the string into an array of single characters, unnest() (using implicit CROSS JOIN LATERAL ), filter the ones of interest (again, using their byte-representation to make it fast. Then simply count per character.

Related:

Count specific characters in a column

Question

1 answers

solution1
0 2020-04-27 02:31:19

Count specific characters in a column

Question

1 answers

solution1 0 2020-04-27 02:31:19

solution1
0 2020-04-27 02:31:19