I am trying to return a list of all the words beginning with a capital letter or title case in a string that are in a sequence.
For example, in the string John Walker Smith is currently in New York
I would like to return the list as below:
['John Walker Smith', 'New York']
My code below works only when there are two title words. How do I extend this to pick up more than two title words in a sequence.
def get_composite_names(s):
l = [x for x in s.split()]
nouns = []
for i in range(0,len(l)):
if i > len(l)-2:
break
if l[i] == l[i].title() and l[i+1] == l[i+1].title():
temp = l[i]+' '+l[i+1]
nouns.append(temp)
return nouns
Here's one way to accomplish this without regex:
from itertools import groupby
string = "John Walker Smith is currently in New York"
groups = []
for key, group in groupby(string.split(), lambda x: x[0].isupper()):
if key:
groups.append(' '.join(list(group)))
print groups
# ['John Walker Smith', 'New York']
In a while loop, when we see a title-cased word, we add it in the list words
.
When we encounter a non-title-cased word, that's when we add the title-cased words (if it's not empty), and reset words
list.
import re
s = 'abcd John Walker Smith is currently in New York'
def get_title_case_words(s):
s = s.split()
r = re.compile(r"[A-Z][a-z]*")
def is_title_case(word):
return r.match(word)
i = 0
res = []
words = []
while i < len(s):
if is_title_case(s[i]):
words.append(s[i])
else:
if words:
res.append(' '.join(words))
words = []
i += 1
if words:
res.append(' '.join(words))
return res
print(get_title_case_words(s))
This seems to do roughly what you wanted, it preserves punctuation marks and one letter words. I'm not sure if that's what you wanted, but hopefully this code gives a good starting point to make it do what you want if it's not.
def get_composite_names(s):
l = [x for x in s.split()]
nouns = []
current_title = None
for i in range(0, len(l)):
if l[i][0].isupper():
if (current_title is not None):
current_title = " ".join((current_title, l[i]))
else:
current_title = l[i]
else:
if (current_title is not None):
nouns.append(current_title)
current_title = None
if (current_title is not None):
nouns.append(current_title)
current_title = None
return nouns
print(get_composite_names("Hello World my name is John Doe"))
#returns ['Hello World', 'John Doe']
print(get_composite_names("I live in Halifax."))
#returns ['I', 'Halifax.']
print(get_composite_names("Even old New York was once New Amsterdam"))
#returns ['Even', 'New York', 'New Amsterdam']
It's not perfect (and I'm pretty bad with Regex) but I did manage to generate this Regex that seems to match what you are looking for:
(?:(?:[A-Z]{1}[a-z]*)(?:$|\s))+
Given the string "John Walker Smith is currently in New York And he feels Great" will match "John Walker Smith ", "New York " and "Great"
Someone could probably attack my regex - feel free to edit this answer with improvements
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.