简体   繁体   English

SQL 统计字段中的字数

[英]SQL count number of words in field

I'd like to make an SQL query where the condition is that column1 contains three or more words.我想做一个 SQL 查询,条件是 column1 包含三个或更多单词。 Is there something to do that?有什么办法吗?

maybe try counting spaces ? 也许尝试计算空间?

SELECT * 
FROM table
WHERE (LENGTH(column1) - LENGTH(replace(column1, ' ', ''))) > 1

and assume words is number of spaces + 1 并假设单词是空格数+ 1

如果你想要一个列包含三个或更多单词的条件, 并且你希望它在一堆数据库中工作, 并且我们假设单词由单个空格分隔,那么你可以使用like

where column1 like '% % %'

In Postgres you can use regexp_split_to_array() for this: 在Postgres中,你可以使用regexp_split_to_array()

select *
from the_table
where array_length(regexp_split_to_array(the_column, '\s+'), 1) >= 3;

This will split the contents of the column the_column into array elements. 这会将列the_column的内容the_column为数组元素。 One ore more whitespace are used as the delimiter. 使用一个或多个空格作为分隔符。 It won't respect "quoted" spaces though. 但它不会尊重“引用”空格。 The value 'one "two three" four' will be counted as four words. 'one "two three" four'将被计为四个字。

The best way to do this, is to NOT do this. 这样做的最好方法是不要这样做。

Instead, you should use the application layer to count the words during INSERT and save the word count into its own column. 相反,您应该使用应用程序层在INSERT期间对单词进行计数,并将单词计数保存到其自己的列中。

While I like, and upvoted, some of the answers here, all of them will be very slow and not 100% accurate. 虽然我喜欢并赞成这里的一些答案,但所有这些答案都会非常缓慢而且不是100%准确。

I know people want a simple answer to SELECT the word count, but it just is NOT POSSIBLE with accuracy and speed. 我知道人们想要一个简单的答案来选择单词计数,但它的准确性和速度都是不可能的。

If you want it to be 100% accurate, and very fast, then use this solution. 如果您希望它100%准确,并且速度非常快,那么请使用此解决方案。

Steps to solve: 要解决的步骤:

  1. Add a column to your table and index it: ALTER TABLE tablename ADD COLUMN wordcount INT UNSIGNED NULL, ADD INDEX idxtablename_count (wordcount ASC); 向表中添加一列并将其编入索引: ALTER TABLE tablename ADD COLUMN wordcount INT UNSIGNED NULL, ADD INDEX idxtablename_count (wordcount ASC); .
  2. Before doing your INSERT, count the number of words using your application. 在进行INSERT之前,请使用您的应用程序计算单词数。 For example in PHP: $count = str_word_count($somevalue); 例如在PHP中: $count = str_word_count($somevalue);
  3. During the INSERT, include the value of $count for the column wordcount like insert into tablename (col1, col2, col3, wordcount) values (val1, val2, val3, $count); 在INSERT期间,为列wordcount包含$count的值,如insert into tablename (col1, col2, col3, wordcount) values (val1, val2, val3, $count);

Then your select statement becomes super easy, clean, uber-fast, and 100% accurate. 然后你的选择语句变得超级简单,干净,超快,并且100%准确。

select * from tablename where wordcount >= 3;

Also remember when you are updating any rows that you will need to recount the words for that column. 还要记住,当您更新任何行时,您需要重新计算该列的单词。

With ClickHouse DB You can use splitByWhitespace() function.使用 ClickHouse DB,您可以使用 splitByWhitespace() function。

Refer: https://clickhouse.com/docs/en/sql-reference/functions/splitting-merging-functions#splitbywhitespaces参考: https://clickhouse.com/docs/en/sql-reference/functions/splitting-merging-functions#splitbywhitespaces

This can work: 这可以工作:

SUM(LENGTH(a) - LENGTH(REPLACE(a, ' ', '')) + 1)

Where a is the string column. 其中a是字符串列。 It will count the number of spaces, which is 1 less than the number of words. 它将计算空格数,比单词数少1。

For "n" or more words 对于“n”或更多的单词

select *
from table
where (length(column)- length(replace(column, " ", "")) + 1) >= n

PS: This would not work if words have multiple spaces between them. PS:如果单词之间有多个空格,则无效。

To handle multiple spaces too, use the method shown here 要处理多个空格,请使用此处显示的方法

Declare @s varchar(100)
set @s='  See      how many                        words this      has  '
set @s=ltrim(rtrim(@s))

while charindex('  ',@s)>0
Begin
    set @s=replace(@s,'  ',' ')
end

select len(@s)-len(replace(@s,' ',''))+1 as word_count

https://exploresql.com/2018/07/31/how-to-count-number-of-words-in-a-sentence/ https://exploresql.com/2018/07/31/how-to-count-number-of-words-in-a-sentence/

I think David nailed it above. 我认为大卫在上面钉了它。 However, as a more complete answer: 但是,作为一个更完整的答案:

LENGTH(RTRIM(LTRIM(REPLACE(column1,'  ', ' ')))) - LENGTH(REPLACE(RTRIM(LTRIM(REPLACE(column1, '  ', ' '))), ' ', '')) + 1 AS number_of_words

This will remove double spaces, as well as leading and trailing spaces in your string. 这将删除双重空格,以及字符串中的前导和尾随空格。

Of course, you may go further by adding replacements for more than 2 spaces in a row... 当然,您可以通过连续添加超过2个空格的替换来进一步...

None of the other answers seem to take multiple spaces into account.其他答案似乎都没有考虑到多个空格。 For example, a lot of people use two spaces between sentences;例如,很多人在句子之间使用两个空格; these space-counters would count an extra word per sentence.这些空格计数器会计算每个句子的额外单词。 "Also, scenarios such as spaces around a hyphen - like that. "

For my purposes, this was far more accurate:就我的目的而言,这要准确得多

SELECT 
  LENGTH(REGEXP_REPLACE(myText, '[ \n\t\|\-]{1,}',' ')) - 
  LENGTH(REGEXP_REPLACE(myText, '[ \n\t\|\-]{1,}', '')) wordCount FROM myTable;

It counts any sets of 1 or more consecutive characters from any of: [ space , linefeed , tab , pipe , or hyphen ] and counts it as one word.它计算任何一组1 个或多个连续字符: [ space , linefeed , tab , pipe , or hyphen ] 并将其计为一个单词。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM