简体   繁体   中英

How to do a wordcount on every row of a CSV file?

I have a CSV file with an ID field and a TEXT field . I need to add a third field with the word count of the TEXT field on every row. How should I proceed?

Example: If this is my starting data frame

  ID                                 TEXT
1  1           Lorem ipsum dolor sit amet
2  2           Praesent venenatis nisl id
3  3 Nunc dapibus maximus vulputate. Nunc

then the desired result is

  ID                                 TEXT WordCount
1  1           Lorem ipsum dolor sit amet         5
2  2           Praesent venenatis nisl id         4
3  3 Nunc dapibus maximus vulputate. Nunc         5

I would use the handy stri_count_words() function from the stringi package.

df$WordCount <- stringi::stri_count_words(df$TEXT)

which gives

  ID                                 TEXT WordCount
1  1           Lorem ipsum dolor sit amet         5
2  2           Praesent venenatis nisl id         4
3  3 Nunc dapibus maximus vulputate. Nunc         5

However in base R, you could split on the spaces with strsplit() after removing the punctuation, then take the lengths of the list elements.

lengths(strsplit(gsub("[[:punct:]]", "", df$TEXT), "\\s+"))
# [1] 5 4 5

Or, as @David suggests, just count the spaces and add 1. trimws() is used to remove any errant spaces that may be lurking at the beginning or end of the string.

lengths(gregexpr("\\s+", trimws(df$TEXT))) + 1L
# [1] 5 4 5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM