简体   繁体   中英

Split character vector based on length

I have a character vector like below.

text <- c(
  "My test",
  "Test2",
  "Tests",
  "Dolphin Sentimental S.r.l.", "Tiger Sentiyapa S.r.l.", 
  "Effort rate calculates to grant (Debt to Income Rate)", 
  "Amount of pensions received mens.", 
  "(Grant data) (Pension Received (Monthly Basis))", 
  "Effort rate calculates to grant (Debt to Income Rate)", 
  "Amount of pensions received mens.", 
  "(Grant data) (Pension Received (Monthly Basis))"
)

If no. of characters in the whole vector (shown above) is greater than 100, split it into multiple character vectors having no. of characters < 100. I tried with the quantile approach but it does not work because if you observe first 3 elements of the vectors contain a fewer texts as compared to elements through 5 to 11 so this approach is not robust and error prone.

nRun <- ceiling(sum(nchar(text),na.rm = T)/100)
cutsIter <- ceiling(quantile(1:length(text),probs = seq.int(0,1,(1/nRun))))

New character Vector

text[cutsIter[1]:cutsIter[2]]

Desired Result First 5 elements should be in one vector. 6th and 7th should be in the same vector and goes on.

Here is one way you could do it. I believe there is a better way but this solution can also be improved. For this purpose I chose to write a custom function. There also remains a problem when there is only 1 vector left whose nchar is equal to 100 . That should be fixed based on your preference.

out <- c()
x <- nchar(text)

fn <- function(x) {
  
  if(max(cumsum(x)) < 100) {
    ind <- max(which(cumsum(x) < 100))
    return(c(out, length(x)))
  } else {
    ind <- max(which(cumsum(x) < 100))
    out <<- c(out, ind)
  }
  
  x <- x[-c(1:ind)]
  fn(x)
}

# The result of the function is the indices for us to split the vector
tmp <- fn(nchar(text))
tmp
[1] 5 2 1 2 1

If we apply it on our vector text :

split(text, rep(seq_len(length(tmp)), tmp))

$`1`
[1] "My test"                    "Test2"                      "Tests"                     
[4] "Dolphin Sentimental S.r.l." "Tiger Sentiyapa S.r.l."    

$`2`
[1] "Effort rate calculates to grant (Debt to Income Rate)"
[2] "Amount of pensions received mens."                    

$`3`
[1] "(Grant data) (Pension Received (Monthly Basis))"

$`4`
[1] "Effort rate calculates to grant (Debt to Income Rate)"
[2] "Amount of pensions received mens."                    

$`5`
[1] "(Grant data) (Pension Received (Monthly Basis))"

And in the end if you would like to create as many vectors:

split(text, rep(seq_len(length(tmp)), tmp)) |>
  setNames(paste0("vec", seq_along(tmp))) |>
  list2env(envir = globalenv())

There is an awesome pre-defined function MESS::cumsumbinning() which you can use easily in these scenarios

text <- c(
  "My test",
  "Test2",
  "Tests",
  "Dolphin Sentimental S.r.l.", "Tiger Sentiyapa S.r.l.", 
  "Effort rate calculates to grant (Debt to Income Rate)", 
  "Amount of pensions received mens.", 
  "(Grant data) (Pension Received (Monthly Basis))", 
  "Effort rate calculates to grant (Debt to Income Rate)", 
  "Amount of pensions received mens.", 
  "(Grant data) (Pension Received (Monthly Basis))"
)

library(MESS)

split(text, cumsumbinning(nchar(text), 100))
#> $`1`
#> [1] "My test"                    "Test2"                     
#> [3] "Tests"                      "Dolphin Sentimental S.r.l."
#> [5] "Tiger Sentiyapa S.r.l."    
#> 
#> $`2`
#> [1] "Effort rate calculates to grant (Debt to Income Rate)"
#> [2] "Amount of pensions received mens."                    
#> 
#> $`3`
#> [1] "(Grant data) (Pension Received (Monthly Basis))"      
#> [2] "Effort rate calculates to grant (Debt to Income Rate)"
#> 
#> $`4`
#> [1] "Amount of pensions received mens."              
#> [2] "(Grant data) (Pension Received (Monthly Basis))"

Needless to say if you want to save each item of above list as a separate item use list3env as

split(text, cumsumbinning(nchar(text), 100)) |>
  list2env(envir = .GlobalEnv)


If you want your threshold limit not to exceed, use threshold 99 in above

split(text, cumsumbinning(nchar(text), 99))

$`1`
[1] "My test"                   
[2] "Test2"                     
[3] "Tests"                     
[4] "Dolphin Sentimental S.r.l."
[5] "Tiger Sentiyapa S.r.l."    

$`2`
[1] "Effort rate calculates to grant (Debt to Income Rate)"
[2] "Amount of pensions received mens."                    

$`3`
[1] "(Grant data) (Pension Received (Monthly Basis))"

$`4`
[1] "Effort rate calculates to grant (Debt to Income Rate)"
[2] "Amount of pensions received mens."                    

$`5`
[1] "(Grant data) (Pension Received (Monthly Basis))"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM