簡體   English   中英

如何使用正則表達式僅提取輸入文本的某些部分?

[英]How to extract only certain sections of an input text using regular expressions?

這個問題很簡單,但是我在這里很迷茫。

輸入文本:

“ / RBR小於/ IN 1/2 / CD / IN全部/ DT US / NNP業務/ NNS是/ VBP唯一/ JJ所有人/ NNS?/。”

編碼:

def get_words(pos_sent):
# Your code goes here
    s = ""
    x = re.findall(r"\b(\w*?)/\w*?\b", pos_sent)
    for i in range(0, len(x)):
        s = s + " " + x[i]
    return s

def get_noun_phrase(pos_sent):
    # Penn Tagset
    # Adjetive can be JJ,JJR,JJS
    # Noun can be NN,NNS,NNP,NNPS
    t = get_words(pos_sent)
    regex = r'((\S+\/DT )?(\S+\/JJ )*(\S+\/NN )*(\S+\/NN))'
    return re.findall(regex, t)

第一部分簡單地刪除了語音標簽的一部分,第二部分應該采用該部分並用它來查找名詞短語。

它應該輸出:

[’all US businesses’, ’sole proprietorships’]

但它輸出一個空列表:

[]

現在,我可以將其更改為采用原始標記的句子,然后得到:

[('all/DT US/NN', 'all/DT ', '', '', 'US/NN'), ('businesses/NN', '', '', '', 'businesses/NN'), ('sole/JJ proprietorships/NN', '', 'sole/JJ ', '', 'proprietorships/NN')]

它確實具有所有合適的功能,但其中也包含了許多我不想要的其他內容。

我對regex還是很陌生,所以我可能缺少一些愚蠢的東西。

對於第一個功能,請使用以下正則表達式- \\b([0-9A-z\\/]*)\\/\\w*?\\b這樣可以確保“ 1/2”保持為1/2 ,而不是1 2 (以及改進的輸出文本格式):

import re

string = 'Less/RBR than/IN 1/2/CD of/IN all/DT US/NNP businesses/NNS are/VBP sole/JJ proprietorships/NNS ?/.'

def create_relationship(pos_sent):
    # Get all the words individually
    words = re.findall(r'\b([0-9A-z\/]*)\/\w*?\b', pos_sent)
    # ['Less', 'than', '1/2', 'of', 'all', 'US', 'businesses', 'are', 'sole', 'proprietorships']

    # Get all the tags individually
    penn_tag = re.findall(r'\b[0-9A-z\/]*\/(\w*)?\b', pos_sent)
    # ['RBR', 'IN', 'CD', 'IN', 'DT', 'NNP', 'NNS', 'VBP', 'JJ', 'NNS']

    # Create a relationship between the words and penn tag:
    relationship = []
    for i in range(0,len(words)):
        relationship.append([words[i],penn_tag[i]])

    # [['Less', 'RBR'], ['than', 'IN'], ['1/2', 'CD'], ['of', 'IN'], ['all', 'DT'], 
    # ['US', 'NNP'], ['businesses', 'NNS'], ['are', 'VBP'], ['sole', 'JJ'], ['proprietorships', 'NNS']]

    return relationship


def get_words(pos_sent):
    # Pass string into relationship engine
    array = create_relationship(pos_sent)

    # Start with empty string
    s = ''

    # Conduct loop to combine string
    for i in range(0, len(array)):
        # index 0 has the words
        s = s + array[i][0] + ' '

    # Return the sentence
    return s

def get_noun_phrase(pos_sent):
    # Penn Tagset
    # Adjetive can be JJ,JJR,JJS
    # Noun can be NN,NNS,NNP,NNPS
    # Noun Phrase must be made of: DT+RB+JJ+NN+PR (http://www.clips.ua.ac.be/pages/mbsp-tags)

    # Pass string into relationship engine
    array = create_relationship(pos_sent)
    bucket = array
    output = []

    # Find the last instance of NN where the next word is not "NN"
    # For example, NNP VBP qualifies. In the case of NN NNP VBP, then
    # the correct instance is NNP. To do this, we need to loop and use
    # a bucket to capture what we need. The bucket will shirnk as we 
    # shrink the array to capture what we want

    noun = True

    # Keep doing this until there is no instances of Nouns
    while noun:

        # Would be ideal to have an if condition to see if there's a noun
        # in the first place to stop this form working (and avoiding errors)
        for i in range(0, len(bucket)):
            if re.match(r'(NN.*)',bucket[i][1]):
                # Set position of last noun
                last_noun = i

        noun_phrase = []

        # If we don't have noun, it'll stop the while loop
        if last_noun < 0:
            noun = False
        else:
            # go backwards from the point where you found the last noun
            for x in range(last_noun, -1, -1):
                # The penn tag must match any of these conditions
                if re.match(r'(NN.*|DT.*|JJ.*|RB.*|PR.*)',bucket[x][1]):
                    # if there is a match, then let's build the word
                    noun_phrase.append(bucket[x][0])
                    bucket.pop(x)
                else:
                    last_noun = -1
                    break

        # Make sure noun phrase isn't empty
        if noun_phrase:
            # Collect the noun phrase
            output.append(" ".join(reversed(noun_phrase)))

    # Fix the reverse issue
    return [i for i in reversed(output)]

print get_noun_phrase(string)
# ['all US businesses', 'sole proprietorships']

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM