I would like to split a large string by comma|semicolon into n-max-size chunks.
This similar question is very close to my situation, but what I really want is splitting by comma|semicolon , with n_max_size limit.
My situation: Using Text-to-Speech service for translating text to voice,but since the limit of the service provider, each request that has max 100 words limit, so I have to split an article to several substrings. If I just split it into fixed n-size, the pause/tone of the voice is not as same as a human.
What would be the best way in terms of performance to do this?
From comments I understand you don't want to split at each comma or semi-colon, but only when the maximum size is about to be reached. Also you want to keep the delimiter (the comma or semi-colon where you split at) in the result.
To add a max-size limit to the regular expression, you can use a regex like .{1,100}
, where 100 is that maximum (for example). If your engine does not support the dotAll flag (yet), then use [^]
instead of .
to ensure that even newline characters are matched here.
To ensure that the split happens just after a delimiter, add (.$|[,;])
to the regex, and reduce the previous {1,100}
to {1,99}
.
Then there is the case where there is no delimiter in a substring of 100 or more characters: the following code will choose to then exceptionally allow a longer chunk, until a delimiter is found. You may want to add white space ( \\s
) as a possible delimiter too.
Here is a function that takes the size as argument and creates the corresponding regex:
const mySplit = (s, maxSize=s.length) => s.match(new RegExp("(?=\\\\S)([^]{1," + (maxSize-1) + "}|[^,;]*)(.$|[,;])", "g")); console.log(mySplit("hello,this is a longer sentence without commas;but no problem", 20));
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.