简体   繁体   中英

Split string into chunks of maximum character count without breaking words

I want to split a string into chunks, each of which is within a maximum character count, say 2000 and does not split a word.

I have tried doing as below:

text.chars.each_slice(2000).map(&:join)

but sometimes, words are split. I have tried some regex:

text.scan(/.{1,2000}\b|.{1,2000}/).map(&:strip)

from this question , but I don't quite get how it works and it gives me some erratic behavior, sometimes giving chunks that only contain periods.

Any pointers will be greatly appreciated.

You could do a Notepad style word wrap.
Just construct the regex using the maximum characters per line quantifier range {1,N} .

The example below uses 32 max per line.

https://regex101.com/r/8vAkOX/1

Update : To include linebreaks within the range, add the dot-all modifier (?s)
Otherwise, stand alone linebreaks are filtered.

(?s)(?:((?>.{1,32}(?:(?<=[^\\S\\r\\n])[^\\S\\r\\n]?|(?=\\r?\\n)|$|[^\\S\\r\\n]))|.{1,32})(?:\\r?\\n)?|(?:\\r?\\n|$))

The chunks are in $1 , and you could replace with $1\\r\\n to get a display
that looks wrapped .

Explained

 (?s) # Span line breaks
 (?:
      # -- Words/Characters 
      (                       # (1 start)
           (?>                     # Atomic Group - Match words with valid breaks
                .{1,32}                 #  1-N characters
                                        #  Followed by one of 4 prioritized, non-linebreak whitespace
                (?:                     #  break types:
                     (?<= [^\S\r\n] )        # 1. - Behind a non-linebreak whitespace
                     [^\S\r\n]?              #      ( optionally accept an extra non-linebreak whitespace )
                  |  (?= \r? \n )            # 2. - Ahead a linebreak
                  |  $                       # 3. - EOS
                  |  [^\S\r\n]               # 4. - Accept an extra non-linebreak whitespace
                )
           )                       # End atomic group
        |  
           .{1,32}                 # No valid word breaks, just break on the N'th character
      )                       # (1 end)
      (?: \r? \n )?           # Optional linebreak after Words/Characters
   |  
      # -- Or, Linebreak
      (?: \r? \n | $ )        # Stand alone linebreak or at EOS
 )

Code

def max_groups(str, n)
  arr = []
  pos = 0     
  loop do
    break arr if pos == str.size
    m = str.match(/.{1,#{n}}(?=[ ]|\z)|.{,#{n-1}}[ ]/, pos)
    return nil if m.nil?
    arr << m[0]
    pos += m[0].size
  end
end

Examples

str = "Now is the time for all good people to party"
  #    12345678901234567890123456789012345678901234
  #    0         1         2         3         4

max_groups(str, 5)
  #=> nil
max_groups(str, 6)
  #=> ["Now is", " the ", "time ", "for ", "all ", "good ", "people", " to 
max_groups(str, 10)
  #=> ["Now is the", " time for ", "all good ", "people to ", "party"]
max_groups(str, 14)
  #=> ["Now is the ", "time for all ", "good people to", " party"]
max_groups(str, 15)
  #=> ["Now is the time", " for all good ", "people to party"]
max_groups(str, 29)
  #=> ["Now is the time for all good ", "people to party"]
max_groups(str, 43)
  #=> ["Now is the time for all good people to ", "party"]
max_groups(str, 44)
  #=> ["Now is the time for all good people to party"]

str = "How        you do?"
  #    123456789012345678
  #    0         1

max_groups(str, 4)
  #=> ["How ", "    ", "   ", "you ", "do?"]

This is what worked for me (thanks to @StefanPochmann's comments):

text = "Some really long string\nwith some line breaks"

The following will first remove all whitespace before breaking the string up.

text.gsub(/\s+/, ' ').scan(/.{1,2000}(?: |$)/).map(&:strip)

The resulting chunks of strings will lose all the line breaks ( \\n ) from the original string. If you need to maintain the line breaks, you need to replace them all with some random placeholder (before applying the regex), for example: (br) , that you can use to restore the line breaks later. Like this:

text = "Some really long string\nwith some line breaks".gsub("\n", "(br)")

After we run the regex, we can restore the line breaks for the new chunks by replacing all occurrences of (br) with \\n like this:

chunks = text.gsub(/\s+/, ' ').scan(/.{1,2000}(?: |$)/).map(&:strip)
chunks.each{|chunk| chunk.gsub!('(br)', "\n")}

Looks like a long process but it worked for me.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM