How to limit text extraction until specific character using regex and python

Question

I have a sentence:

text = "Alun-alun/NNP Jombang/NNP tepatnya/RB Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP"

I like to extract any word from tag /IN until last word with /NNP tag.

The code so far can extract the Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP . But I want it to stop if the code meet either /: or /IN tag. Here is the code so far:

import re

def entityExtract(text):
    # text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/NNP\b)', text)
    text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/(?:NNP|CDP)\b)', text)
    return text

text = "Alun-alun/NNP Jombang/NNP tepatnya/RB Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP"

extract = entityExtract(text)

print text
print extract

Output:

['Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP']

Expected result is:

['Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP]

What is the best way to solve it?

Answer 1

[^\\s/]*/IN\\b([^/]*/(?!IN\\b|:\\b)[^\\s^/]*\\b)*[^/]*/NNP\\b

Am as confused as @DYZ about where you want to stop, so I based my regex on your output.
I believe you want to extract 'word/tag' sections of the string and word+tag are strongly coupled.

Where you want to stop your tag at without including it is controlled by this group (?!IN\\b|:\\b|NN\\b)

Check regex here

Answer 2

I've looked at the answer from @bulbus and the regex that @ytomo showed in the comments, which is:

[^\s/]*/IN\b[^/]*(?:/(?!IN\b|:\b)[^/]*\b)*/(?:NNP|CDP)\b

My problem is, this one - and the other proposals - do not follow a logic order to create a regex for the problem at hand. Let me show you:

The first part, before the repeating group [^\\s/]*/IN\\b[^/]* which I'm going to simplify to \\w+/IN\\b [^/]*' matches more than you should want to. Look at example 1 .

What you're solving here, in words, is:

read a \\w+/IN group
followed by any number of \\s[^/]+/\\w+ groups, that's not a \\w+/IN\\b
as long as you can read.....until
....you've matched the last NNP or CDP group you can find.

Translate that directly to a regex and you'll come up with a more readable version. (JMHO)

\\w+/IN\\b(\\s[^/]+/[^\\s]+) read first group after IN-group ( example 2 )
\\w+/IN\\b(\\s[^/]+/[^\\s]+)* repeat that second group ( example 3 )
\\w+/IN\\b(\\s[^:/]+/(?!IN|:)[^\\s]+)* ignore :/: and \\w+/IN groups ( example 4 )
\\w+/IN\\b(\\s[^:/]+/(?!IN|:)[^\\s]+)*\\s\\w+/(NNP|CDP)\\b Make sure your last group is NNP or CDP ( example 5 )

If we compare this one to the proposed result of @ytomo in the comments of the preceding answer, there seems to be not that much difference. However, the reason I even bothered to answer is, that a regex should readable and according to some logic. Your code is going to be in production tomorrow, and - when your code breaks - someone has to check it under some time pressure.

How to limit text extraction until specific character using regex and python

Question

2 answers

solution1
2 ACCPTED 2017-09-08 23:24:36

solution2
1 2017-09-09 02:14:25

How to limit text extraction until specific character using regex and python

Question

2 answers

solution1 2 ACCPTED 2017-09-08 23:24:36

solution2 1 2017-09-09 02:14:25

solution1
2 ACCPTED 2017-09-08 23:24:36

solution2
1 2017-09-09 02:14:25