I have a sentence:
text = "Alun-alun/NNP Jombang/NNP tepatnya/RB Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP"
I like to extract any word from tag /IN
until last word with /NNP
tag.
The code so far can extract the Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP
. But I want it to stop if the code meet either /:
or /IN
tag. Here is the code so far:
import re
def entityExtract(text):
# text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/NNP\b)', text)
text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/(?:NNP|CDP)\b)', text)
return text
text = "Alun-alun/NNP Jombang/NNP tepatnya/RB Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP"
extract = entityExtract(text)
print text
print extract
Output:
['Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP']
Expected result is:
['Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP]
What is the best way to solve it?
[^\\s/]*/IN\\b([^/]*/(?!IN\\b|:\\b)[^\\s^/]*\\b)*[^/]*/NNP\\b
Am as confused as @DYZ about where you want to stop, so I based my regex on your output.
I believe you want to extract 'word/tag'
sections of the string and word+tag
are strongly coupled.
Where you want to stop your tag at without including it is controlled by this group (?!IN\\b|:\\b|NN\\b)
Check regex here
I've looked at the answer from @bulbus and the regex that @ytomo showed in the comments, which is:
[^\s/]*/IN\b[^/]*(?:/(?!IN\b|:\b)[^/]*\b)*/(?:NNP|CDP)\b
My problem is, this one - and the other proposals - do not follow a logic order to create a regex for the problem at hand. Let me show you:
The first part, before the repeating group [^\\s/]*/IN\\b[^/]*
which I'm going to simplify to \\w+/IN\\b
[^/]*' matches more than you should want to. Look at example 1 .
What you're solving here, in words, is:
Translate that directly to a regex and you'll come up with a more readable version. (JMHO)
\\w+/IN\\b(\\s[^/]+/[^\\s]+)
read first group after IN-group ( example 2 ) \\w+/IN\\b(\\s[^/]+/[^\\s]+)*
repeat that second group ( example 3 ) \\w+/IN\\b(\\s[^:/]+/(?!IN|:)[^\\s]+)*
ignore :/: and \\w+/IN groups ( example 4 ) \\w+/IN\\b(\\s[^:/]+/(?!IN|:)[^\\s]+)*\\s\\w+/(NNP|CDP)\\b
Make sure your last group is NNP or CDP ( example 5 ) If we compare this one to the proposed result of @ytomo in the comments of the preceding answer, there seems to be not that much difference. However, the reason I even bothered to answer is, that a regex should readable and according to some logic. Your code is going to be in production tomorrow, and - when your code breaks - someone has to check it under some time pressure.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.