Let's say we have a sentence like this,
string = "He/PRP has/VBZ some/DT well/RB made/VBN clothes/NNS made/VBN by/IN a/DT Italian/JJ American/JJ tailor/NN in/IN the/DT Italian/JJ club/NN ./."
and I have a list of compound words to be highlighted.
target = ['He', 'wellmade', 'ItalianAmerican']
and I want to get the result looks like below.
"[He/PRP] has/VBZ some/DT [well/RB made/VBN] clothes/NNS made/VBN by/IN a/DT [Italian/JJ American/JJ] tailor/NN in/IN the/DT Italian/JJ club/NN ./."
It is assumed that the length of each target item is than the corresponding tokens in a sentence. 。 I think I should first spot the the span that corresponds to target items, and then insert the brackets, but I can't implement it into a code. Please give me some hint. thanks!
It is easy with 'He', problems begin with 'wellmade', as it is a compound word that is split in the input string, even with suffixes appended. I'd suggest turning your target
items into regex patterns with optional groups: (?:\\/[AZ]+\\s*|\\s)?
should be inserted after each letter but the last, and (?:\\/[AZ]+)?
after the last letter.
Have a look at a sample regex for ItalianAmerican
:
I(?:\\/[AZ]+\\s*|\\s)?t(?:\\/[AZ]+\\s*|\\s)?a(?:\\/[AZ]+\\s*|\\s)?l(?:\\/[AZ]+\\s*|\\s)?i(?:\\/[AZ]+\\s*|\\s)?a(?:\\/[AZ]+\\s*|\\s)?n(?:\\/[AZ]+\\s*|\\s)?A(?:\\/[AZ]+\\s*|\\s)?m(?:\\/[AZ]+\\s*|\\s)?e(?:\\/[AZ]+\\s*|\\s)?r(?:\\/[AZ]+\\s*|\\s)?i(?:\\/[AZ]+\\s*|\\s)?c(?:\\/[AZ]+\\s*|\\s)?a(?:\\/[AZ]+\\s*|\\s)?n(?:\\/[AZ]+)?
Have a look at the demo example .
Is this what you are looking for?
import re
re.sub(r'((?:He|well.*?made|Italian.*?American).*?)(\s)', r'[\1]\2', string)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.