简体   繁体   中英

Smart split large text into words and signs, like spaces and other characters

I'm working on web-application for text processing. I need split string(text) into words and signs, like spaces and other characters(comma, dot, semicolon, eg) Every word and every sign i need to wrap into html tag. Every tag must have an id attribute, that contains word(sign) ordinal number in the text. This processing will work in the Java Servlet and this mean that perfomance is important. Proccessed text may contains 3000 - 5000 words.

Here is a sample input:

One two three, four five six seven eight nine.

Here is a sample output:

<span id="w1" class="word">One </span><span id="w2" class="space">&nbsp;</span><span id="w3" class="word">two</span><span id="w4" class="space">&nbsp;</span><span id="w5" class="word">three</span><span id="w6" class="sign">,</span><span id="w7" class="space">&nbsp;</span><span id="w8" class="word">four</span><span id="w9" class="space">&nbsp;</span><span id="w10" class="word">five</span><span id="w11" class="space">&nbsp;</span><span id="w12" class="word">six</span><span id="w13" class="space">&nbsp;</span><span id="w14" class="word">seven</span><span id="w15" class="space">&nbsp;</span><span id="w16" class="word">eight </span><span id="w17" class="space">&nbsp;</span><span id="w18" class="word">nine</span><span id="w19" class="sign">.</span>

Thanks to all for any advice how i can do it.

Update: The code below splits string by non alphanumeric symbols

text.split("[^a-zA-Z0-9]")

and this code:

text.split("\\b[a-zA-Z0-9]+\\b")

splits string by words, but i don't understand how to combine regex for split by words and non alphanumeric symbols?

Update2:

It seems like it is answer:

val text = "Hello from Scala - regex  world!"
val pattern = "[^a-zA-Z0-9|а-яА-Я0-9]|\\b[a-zA-Z0-9|а-яА-Я0-9]+\\b".r
pattern.findAllIn(text).matchData foreach {
  m => println("'" + m.group(0) + "'")
}

The part of patterns after the "|" sign it is Cyrillic pattern for:

a-zA-Z0-9

I can't give you the full code, but if it can get you started... I recommend:

  1. "Splitting" your string into the groups of characters you want by matching it with

     /\\b[a-zA-Z]+\\b|[^a-zA-Z]/g 

    This regex wil match words with \\b[a-zA-Z]+\\b with \\b being a word boundary, or [^a-zA-Z] any other character non alphabetic. You will end up with a list of matches.

  2. Going through your matches one by one, wrapping your results in the correct tags by incrementing the id and checking:

    • if the first character is a space, then class="space"
    • if the first character is a letter, then class="word"
    • else class="sign"

Careful, the first regex will count ... as three separated characters, as well as 123 as three separated signs. You could adapt it with

/\b[a-zA-Z]+\b|\b\d+\b|\.\.\.|[^a-zA-Z]/g

and add as many special cases as you want, you get the idea.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM