简体   繁体   中英

Functional Programming using python for extracting consecutive capitalized words

I am now using python to extract consecutive capitalized words (at least two) in a text.

For example, there is a sentence

Hollywood is a neighborhood in the central region of Los Angeles.

Then the expected output should be

Los Angeles

I am trying to do this thing in a functional programming way.

import itertools
import string
import operator

text = "Take any tram, U-bahn or bus which stops at Düsseldorf Hauptbahnhof (HBF). Leave the station via the main exit Konrad Adenauer Platz, you will see trams and buses in front of the station. Walk up Friedrich Ebert Straße turning right into the third street which is the Oststraße."

def fold(it):
    def fold_impl(x, y):
        return itertools.starmap(operator.and_, zip(x, itertools.islice(y, 1, None)))
    return fold_impl(*itertools.tee(it))

def unfold(it):
    def unfold_impl(x, y):
        return itertools.starmap(operator.or_, zip(itertools.chain(x, [False]), itertools.chain([False], y)))
    return unfold_impl(*itertools.tee(it))

def ngrams(it, n):
    return it if n <= 1 else unfold(ngrams(fold(it), n - 1))

def ngrams_idx(it, n):
    return (sorted(x[0] for x in g) for k, g in itertools.groupby(enumerate(ngrams(it, n)), key=lambda x: x[1]) if k)

def booleanize(text_vec):
    return map(lambda x: x[0] in string.ascii_uppercase, text_vec)

def ngrams_phrase(text_vec, n):
    def word(text_vec, idx):
        return ' '.join(map(lambda i: text_vec[i], idx))
    return [word(text_vec, idx) for idx in ngrams_idx(booleanize(text_vec), n)]

But I think I am making it a little bit too complicated, is there any simpler way to deal with this question using functional programming ?

Is not really a good practice in python, but the shortest way is to reduce the splited text:

p = "Hollywood is a neighborhood in the central region of Los Angeles.".split()
t, _ = reduce(lambda (l, v), x: (l+[v, x], x) if v[0].isupper() and x[0].isupper() else (l, x), p, ([], "a"))
['Los', 'Angeles.']

Have a look at this:

from itertools import takewhile

text = "Take any tram, U-bahn or bus which stops at Düsseldorf Hauptbahnhof (HBF). Leave the station via the main exit Konrad Adenauer Platz, you will see trams and buses in front of the station. Walk up Friedrich Ebert Straße turning right into the third street which is the Oststraße."

def take_upper(text):
    it = iter(text.split())
    return [[i]+list(takewhile(lambda x: x[0].isupper(), it)) for i in it if i[0].isupper()]

def remove_singles(text_uppers):
    return [l for l in text_uppers if len(l) > 1]

remove_singles(take_upper(text))

我认为该条目调用将为ngram_phrase(text.split(), 2) ,OP正在查找所有出现的短语,这些短语的连续大写字母首字母至少为2 ,例如,将代码段与text一起运行会产生["Düsseldorf Hauptbahnhof", "Konrad Adenauer Platz", "Friedrich Ebert Straße"]

I would have provided on of the answers above but they were already provided! So I wrote the following function to allow you to see the flow.

def find_proper(text):
    text = text.rstrip().split(' ')
    proper = []
    data, cnt, pos, str = [x[0].isupper() for x in text], 0, 0, ''
    while True:
        if pos == len(text):
            if cnt > 1:
                proper.append(str.rstrip())
            break
        if data[pos]:
            cnt += 1
            str += text[pos]+' '
        else:
            if cnt > 1:
                proper.append(str.rstrip())
            str = ''
            cnt = 0
        pos += 1
    return proper
print find_proper(text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM