简体   繁体   中英

Postgres and Word Clouds

I would like to know if its possible to create a Postgres function to scan some table rows and create a table that contains WORD and AMOUNT (frequency)? My goal is to use this table to create a Word Cloud.

There is a simple way, but it can be slow (depending on your table size). You can split your text into an array:

SELECT string_to_array(lower(words), ' ') FROM table;

With those arrays, you can use unnest to aggregate them:

WITH words AS (
    SELECT unnest(string_to_array(lower(words), ' ')) AS word
    FROM table
)
SELECT word, count(*) FROM words
GROUP BY word;

This is a simple way of doing that and, has some issues, like, it only split words by space not punctuation marks.

Other, and probably better option , is to usePostgreSQL full text search .

Late to the party but I also needet this and wanted to use full text search.
Which conveniently removes html tags.

So basically you convert the text to a tsvector and then use ts_stat :

select word, nentry 
from ts_stat($q$ 
    select to_tsvector('simple', '<div id="main">a b c <b>b c</b></div>') 
$q$)
order by nentry desc

Result:

|word|nentry|
|----|------|
|c   |2     |
|b   |2     |
|a   |1     |

But this does not scale well, so here is what I endet up with:

Setup:

-- table with a gist index on the tsvector column
create table wordcloud_data (
    html text not null,
    tsv tsvector not null
);
create index on wordcloud_data using gist (tsv);

-- trigger to update the tsvector column
create trigger wordcloud_data_tsvupdate 
    before insert or update on wordcloud_data 
    for each row execute function tsvector_update_trigger(tsv, 'pg_catalog.simple', html);

-- a view for the wordcloud
create view wordcloud as select word, nentry from ts_stat('select tsv from wordcloud_data') order by nentry desc;

Usage:

-- insert some data
insert into wordcloud_data (html) values 
    ('<div id="id1">aaa</div> <b>bbb</b> <i attribute="ignored">ccc</i>'), 
    ('<div class="class1"><span>bbb</span> <strong>ccc</strong> <pre>ddd</pre></div>');

After that your wordcloud view should look like this:

|word|nentry|
|----|------|
|ccc |2     |
|bbb |2     |
|ddd |1     |
|aaa |1     |

Bonus features:
Replace simple with for example english and postgres will strip out stop words and do stemming for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM