简体   繁体   中英

Python Search the specific word sequence from the pos sequence and highlight it

Let's say we have a sentence like this,

 string = "He/PRP has/VBZ some/DT well/RB made/VBN clothes/NNS made/VBN by/IN a/DT Italian/JJ American/JJ tailor/NN in/IN the/DT Italian/JJ club/NN ./."

and I have a list of compound words to be highlighted.

target = ['He', 'wellmade', 'ItalianAmerican']

and I want to get the result looks like below.

"[He/PRP] has/VBZ some/DT [well/RB made/VBN] clothes/NNS made/VBN by/IN a/DT [Italian/JJ American/JJ] tailor/NN in/IN the/DT Italian/JJ club/NN ./."

It is assumed that the length of each target item is than the corresponding tokens in a sentence. I think I should first spot the the span that corresponds to target items, and then insert the brackets, but I can't implement it into a code. Please give me some hint. thanks!

It is easy with 'He', problems begin with 'wellmade', as it is a compound word that is split in the input string, even with suffixes appended. I'd suggest turning your target items into regex patterns with optional groups: (?:\\/[AZ]+\\s*|\\s)? should be inserted after each letter but the last, and (?:\\/[AZ]+)? after the last letter.

Have a look at a sample regex for ItalianAmerican :

I(?:\\/[AZ]+\\s*|\\s)?t(?:\\/[AZ]+\\s*|\\s)?a(?:\\/[AZ]+\\s*|\\s)?l(?:\\/[AZ]+\\s*|\\s)?i(?:\\/[AZ]+\\s*|\\s)?a(?:\\/[AZ]+\\s*|\\s)?n(?:\\/[AZ]+\\s*|\\s)?A(?:\\/[AZ]+\\s*|\\s)?m(?:\\/[AZ]+\\s*|\\s)?e(?:\\/[AZ]+\\s*|\\s)?r(?:\\/[AZ]+\\s*|\\s)?i(?:\\/[AZ]+\\s*|\\s)?c(?:\\/[AZ]+\\s*|\\s)?a(?:\\/[AZ]+\\s*|\\s)?n(?:\\/[AZ]+)?

Have a look at the demo example .

Is this what you are looking for?

import re
re.sub(r'((?:He|well.*?made|Italian.*?American).*?)(\s)', r'[\1]\2', string)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM