Python - Extracting all camel case words in a sequence

Question

I am trying to return a list of all the words beginning with a capital letter or title case in a string that are in a sequence.

For example, in the string John Walker Smith is currently in New York I would like to return the list as below:

['John Walker Smith', 'New York']

My code below works only when there are two title words. How do I extend this to pick up more than two title words in a sequence.

def get_composite_names(s):
    l = [x for x in s.split()]
    nouns = []
    for i in range(0,len(l)):
        if i > len(l)-2:
            break
        if l[i] == l[i].title() and l[i+1] == l[i+1].title():
                temp = l[i]+' '+l[i+1]
                nouns.append(temp)
    return nouns

Answer 1

Here's one way to accomplish this without regex:

from itertools import groupby

string = "John Walker Smith  is currently in New York"

groups = []

for key, group in groupby(string.split(), lambda x: x[0].isupper()):
    if key:
        groups.append(' '.join(list(group)))

print groups
# ['John Walker Smith', 'New York']

Answer 2

In a while loop, when we see a title-cased word, we add it in the list words .

When we encounter a non-title-cased word, that's when we add the title-cased words (if it's not empty), and reset words list.

import re

s = 'abcd John Walker Smith is currently in New York'

def get_title_case_words(s):
  s = s.split()
  r = re.compile(r"[A-Z][a-z]*")

  def is_title_case(word):
    return r.match(word)

  i = 0
  res = []
  words = []
  while i < len(s):
    if is_title_case(s[i]):
      words.append(s[i])
    else:
      if words:
        res.append(' '.join(words))
        words = []

    i += 1

  if words:
    res.append(' '.join(words))

  return res

print(get_title_case_words(s))

Answer 3

This seems to do roughly what you wanted, it preserves punctuation marks and one letter words. I'm not sure if that's what you wanted, but hopefully this code gives a good starting point to make it do what you want if it's not.

def get_composite_names(s):
    l = [x for x in s.split()]
    nouns = []
    current_title = None
    for i in range(0, len(l)):
        if l[i][0].isupper():
            if (current_title is not None):
                current_title = " ".join((current_title, l[i]))
            else:
                current_title = l[i]
        else:
            if (current_title is not None):
                nouns.append(current_title)
                current_title = None

    if (current_title is not None):
        nouns.append(current_title)
        current_title = None

    return nouns

print(get_composite_names("Hello World my name is John Doe"))

#returns ['Hello World', 'John Doe']

print(get_composite_names("I live in Halifax."))

#returns ['I', 'Halifax.']

print(get_composite_names("Even old New York was once New Amsterdam"))

#returns ['Even', 'New York', 'New Amsterdam']

Answer 4

It's not perfect (and I'm pretty bad with Regex) but I did manage to generate this Regex that seems to match what you are looking for:

(?:(?:[A-Z]{1}[a-z]*)(?:$|\s))+

Given the string "John Walker Smith is currently in New York And he feels Great" will match "John Walker Smith ", "New York " and "Great"

Someone could probably attack my regex - feel free to edit this answer with improvements

Python - Extracting all camel case words in a sequence

Question

4 answers

solution1
6 ACCPTED 2018-03-25 17:42:34

solution2
0 2018-03-25 17:38:31

solution3
0 2018-03-25 17:43:50

solution4
0 2018-03-25 18:19:19

Python - Extracting all camel case words in a sequence

Question

4 answers

solution1 6 ACCPTED 2018-03-25 17:42:34

solution2 0 2018-03-25 17:38:31

solution3 0 2018-03-25 17:43:50

solution4 0 2018-03-25 18:19:19

solution1
6 ACCPTED 2018-03-25 17:42:34

solution2
0 2018-03-25 17:38:31

solution3
0 2018-03-25 17:43:50

solution4
0 2018-03-25 18:19:19