简体   繁体   中英

Python - Extracting all camel case words in a sequence

I am trying to return a list of all the words beginning with a capital letter or title case in a string that are in a sequence.

For example, in the string John Walker Smith is currently in New York I would like to return the list as below:

['John Walker Smith', 'New York']

My code below works only when there are two title words. How do I extend this to pick up more than two title words in a sequence.

def get_composite_names(s):
    l = [x for x in s.split()]
    nouns = []
    for i in range(0,len(l)):
        if i > len(l)-2:
            break
        if l[i] == l[i].title() and l[i+1] == l[i+1].title():
                temp = l[i]+' '+l[i+1]
                nouns.append(temp)
    return nouns

Here's one way to accomplish this without regex:

from itertools import groupby

string = "John Walker Smith  is currently in New York"

groups = []

for key, group in groupby(string.split(), lambda x: x[0].isupper()):
    if key:
        groups.append(' '.join(list(group)))

print groups
# ['John Walker Smith', 'New York']

In a while loop, when we see a title-cased word, we add it in the list words .

When we encounter a non-title-cased word, that's when we add the title-cased words (if it's not empty), and reset words list.

import re

s = 'abcd John Walker Smith is currently in New York'

def get_title_case_words(s):
  s = s.split()
  r = re.compile(r"[A-Z][a-z]*")

  def is_title_case(word):
    return r.match(word)

  i = 0
  res = []
  words = []
  while i < len(s):
    if is_title_case(s[i]):
      words.append(s[i])
    else:
      if words:
        res.append(' '.join(words))
        words = []

    i += 1

  if words:
    res.append(' '.join(words))

  return res

print(get_title_case_words(s))

This seems to do roughly what you wanted, it preserves punctuation marks and one letter words. I'm not sure if that's what you wanted, but hopefully this code gives a good starting point to make it do what you want if it's not.

def get_composite_names(s):
    l = [x for x in s.split()]
    nouns = []
    current_title = None
    for i in range(0, len(l)):
        if l[i][0].isupper():
            if (current_title is not None):
                current_title = " ".join((current_title, l[i]))
            else:
                current_title = l[i]
        else:
            if (current_title is not None):
                nouns.append(current_title)
                current_title = None

    if (current_title is not None):
        nouns.append(current_title)
        current_title = None

    return nouns

print(get_composite_names("Hello World my name is John Doe"))

#returns ['Hello World', 'John Doe']

print(get_composite_names("I live in Halifax."))

#returns ['I', 'Halifax.']

print(get_composite_names("Even old New York was once New Amsterdam"))

#returns ['Even', 'New York', 'New Amsterdam']

It's not perfect (and I'm pretty bad with Regex) but I did manage to generate this Regex that seems to match what you are looking for:

(?:(?:[A-Z]{1}[a-z]*)(?:$|\s))+

Given the string "John Walker Smith is currently in New York And he feels Great" will match "John Walker Smith ", "New York " and "Great"

Someone could probably attack my regex - feel free to edit this answer with improvements

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM