简体   繁体   中英

How to limit text extraction until specific character using regex and python

I have a sentence:

text = "Alun-alun/NNP Jombang/NNP tepatnya/RB Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP"

I like to extract any word from tag /IN until last word with /NNP tag.

The code so far can extract the Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP . But I want it to stop if the code meet either /: or /IN tag. Here is the code so far:

import re

def entityExtract(text):
    # text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/NNP\b)', text)
    text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/(?:NNP|CDP)\b)', text)
    return text

text = "Alun-alun/NNP Jombang/NNP tepatnya/RB Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP"

extract = entityExtract(text)

print text
print extract

Output:

['Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP pagi/NN :/: :/: :/: Minggu/NNP']

Expected result is:

['Depan/IN SMP/NNP 2/CDP Jombang/NNP Besok/NNP]

What is the best way to solve it?

[^\\s/]*/IN\\b([^/]*/(?!IN\\b|:\\b)[^\\s^/]*\\b)*[^/]*/NNP\\b

Am as confused as @DYZ about where you want to stop, so I based my regex on your output.
I believe you want to extract 'word/tag' sections of the string and word+tag are strongly coupled.

Where you want to stop your tag at without including it is controlled by this group (?!IN\\b|:\\b|NN\\b)

Check regex here

I've looked at the answer from @bulbus and the regex that @ytomo showed in the comments, which is:

[^\s/]*/IN\b[^/]*(?:/(?!IN\b|:\b)[^/]*\b)*/(?:NNP|CDP)\b

My problem is, this one - and the other proposals - do not follow a logic order to create a regex for the problem at hand. Let me show you:

The first part, before the repeating group [^\\s/]*/IN\\b[^/]* which I'm going to simplify to \\w+/IN\\b [^/]*' matches more than you should want to. Look at example 1 .

What you're solving here, in words, is:

  • read a \\w+/IN group
  • followed by any number of \\s[^/]+/\\w+ groups, that's not a \\w+/IN\\b
  • as long as you can read.....until
  • ....you've matched the last NNP or CDP group you can find.

Translate that directly to a regex and you'll come up with a more readable version. (JMHO)

  1. \\w+/IN\\b(\\s[^/]+/[^\\s]+) read first group after IN-group ( example 2 )
  2. \\w+/IN\\b(\\s[^/]+/[^\\s]+)* repeat that second group ( example 3 )
  3. \\w+/IN\\b(\\s[^:/]+/(?!IN|:)[^\\s]+)* ignore :/: and \\w+/IN groups ( example 4 )
  4. \\w+/IN\\b(\\s[^:/]+/(?!IN|:)[^\\s]+)*\\s\\w+/(NNP|CDP)\\b Make sure your last group is NNP or CDP ( example 5 )

If we compare this one to the proposed result of @ytomo in the comments of the preceding answer, there seems to be not that much difference. However, the reason I even bothered to answer is, that a regex should readable and according to some logic. Your code is going to be in production tomorrow, and - when your code breaks - someone has to check it under some time pressure.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM