简体   繁体   中英

How to find all words in a string that begin with an uppercase letter, for multiple strings in a list

I have a list of strings, each string is about 10 sentences. I am hoping to find all words from each string that begin with a capital letter. Preferably after the first word in the sentence. I am using re.findall to do this. When I manually set the string = '' I have no trouble do this, however when I try to use a for loop to loop over each entry in my list I get a different output.

for i in list_3:
    string = i
    test = re.findall(r"(\b[A-Z][a-z]*\b)", string)
print(test)

output:

['I', 'I', 'As', 'I', 'University', 'Illinois', 'It', 'To', 'It', 'I', 'One', 'Manu', 'I', 'I', 'Once', 'And', 'Through', 'I', 'I', 'Most', 'Its', 'The', 'I', 'That', 'I', 'I', 'I', 'I', 'I', 'I']

When I manually input the string value

txt = 0
for i in list_3:
    string = list_3[txt]
    test = re.findall(r"(\b[A-Z][a-z]*\b)", string)
print(test)

output:

['Remember', 'The', 'Common', 'App', 'Do', 'Your', 'Often', 'We', 'Monica', 'Lannom', 'Co', 'Founder', 'Campus', 'Ventures', 'One', 'Break', 'Campus', 'Ventures', 'Universities', 'Undermatching', 'Stanford', 'Yale', 'Undermatching', 'What', 'A', 'Yale', 'Lannom', 'There', 'During', 'Some', 'The', 'Lannom', 'That', 'It', 'Lannom', 'Institutions', 'University', 'Chicago', 'Boston', 'College', 'These', 'Students', 'If', 'Lannom', 'Recruiting', 'Elite', 'Campus', 'Ventures', 'Understanding', 'Campus', 'Ventures', 'The', 'For', 'Lannom', 'What', 'I', 'Wish', 'I', 'Knew', 'Before', 'Starting', 'Company', 'I', 'Even', 'I', 'Lannom', 'The', 'There']

But I can't seem to write a for loop that correctly prints the output for each of the 5 items in the list. Any ideas?

The easiest way yo do that is to write a for loop which checks whether the first letter of an element of the list is capitalized. If it is, it will be appended to the output list.

output = []
for i in list_3:
    if i[0] == i[0].upper():
        output.append(i)
print(output)

We can also use the list comprehension and made that in 1 line. We are also checking whether the first letter of an element is the capitalized letter.

output = [x for x in list_3 if x[0].upper() == x[0]]
print(output)

EDIT

You want to place the sentence as an element of a list so here is the solution. We iterate over the list_3 , then iterate for every word by using the split() function. We are thenchecking whether the word is capitalized. If it is, it is added to an output .

list_3 = ["Remember your college application process? The tedious Common App applications, hours upon hours of research, ACT/SAT, FAFSA, visiting schools, etc. Do you remember who helped you through this process? Your family and guidance counselors perhaps, maybe your peers or you may have received little to no help"]
output = []
for i in list_3:
    for j in i.split():
        if j[0].isupper():
            output.append(j)
print(output)

Assuming sentences are separated by one space, you could use re.findall with the following regular expression.

r'(?m)(?<!^)(?<![.?!] )[A-Z][A-Za-z]*'

Start your engine! | Python code

Python's regex engine performs the following operations.

(?m)         : set multiline mode so that ^ and $ match the beginning
               and the end of a line
(?<!^)       : negative lookbehind asserts current location is not
               at the beginning of a line
(?<![.?!] )  : negative lookbehind asserts current location is not
               preceded by '.', '?' or '!', followed by a space
[A-Z]        : match an uppercase letter
[A-Za-z]*    : match 1+ letters

If sentences can be separated by one or two spaces, insert the negative lookbehind (?<.[??!] ) after (?<.[??!] ) .

If the PyPI regex module were used, one could use the variable-length lookbehind (?<.[??!] +)

As i understand, you have list like this:

list_3 = [
  'First sentence. Another Sentence',
  'And yet one another. Sentence',
]

You are iterating over the list but every iteration overrides test variable, thus you have incorrect result. You eihter have to accumulate result inside additional variable or print it right away, every iteration:

acc = []
for item in list_3:
  acc.extend(re.findall(regexp, item))
print(acc)

or

for item in list_3:
  print(re.findall(regexp, item))

As for regexp, that ignores first word in the sentence, you can use

re.findall(r'(?<!\A)(?<!\.)\s+[A-Z]\w+', s) 
  • (?<!\A) - not the beginning of the string
  • (?<.\.) - not the first word after dot
  • \s+ - optional spaces after dot.

You'll receive words potentialy prefixed by space, so here's final example:

acc = []
for item in list_3:
  words = [w.strip() for w in re.findall(r'(?<!\A)(?<!\.)\s+[A-Z]\w+', item)]
  acc.extend(words)
print(acc)

as I really like regexes, try this one:

#!/bin/python3
import re

PATTERN = re.compile(r'[A-Z][A-Za-z0-9]*')

all_sentences = [
    "My House! is small",
    "Does Annie like Cats???"
]

def flat_list(sentences):
    for sentence in sentences:
        yield from PATTERN.findall(sentence)

upper_words = list(flat_list(all_sentences))
print(upper_words)

# Result: ['My', 'House', 'Does', 'Annie', 'Cats']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM