简体   繁体   中英

How do I remove the substrings started with capital letters in a Python string?

I have this string which is a mix between a title and a regular sentence (there is no separator separating the two).

text = "Read more: Indonesia to Get Moderna Vaccines Before the pandemic began, a lot of people were...."

The title actually ends at the word Vaccines , the Before the pandemic is another sentence completely separate from the title.

How do I remove the substring until the word vaccines? My idea was to remove all words from the words "Read more:" to all the words after that that start with capital until before one word ( before ). But I don't know what to do if it meets with conjunction or preposition that doesn't need to be capitalized in a title, like the word the .

I know there is a function title() to convert a string into a title format in Python, but is there any function that can detect if a substring is a title?

I have tried the following using regular expression.

import re
text = "Read more: Indonesia to Get Moderna Vaccines Before the pandemic began, a lot of people were...."
res = re.sub(r"\s*[A-Z]\s*", " ", text)
res

But it just removed all words started with capital letters instead.

You can match the title by matching a sequence of capitalized words and words that can be non-capitalized in titles .

^(?:Read\s+more\s*:)?\s*(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\s+)*(?=[A-Z])

See the regex demo .

Details :

  • ^ - start of string
  • (?:Read\s+more\s*:)? - an optional non-capturing group matching Read , one or more whitespaces, more , zero or more whitespaces and a :
  • \s* - zero or more whitespaces
  • (?:(?:[AZ]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\s+)* - zero or more sequences of
    • (?:[AZ]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of) - an capitalized word that may contain any non-whitespace chars or one of the words that can stay non-capitalized in an English title
    • \s+ - one or more whitespaces
  • (?=[AZ]) - followed with an uppercase letter.

NOTE : You mentioned your language is not English, so

  1. You need to find the list of your language words that may go non-capitalized in a title and use them instead of ^(?:Read\s+more\s*:)?\s*(?:(?:[AZ]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of
  2. You might want to replace [AZ] with \p{Lu} to match any Unicode uppercase letters and \S* with \p{L}* to match any zero or more Unicode letters, BUT make sure you use the PyPi regex library then as Python built-in re does not support the Unicode category classes.

Why don't you just use slicing?

title = text[:44]
print(title)

Read more: Indonesia to Get Moderna Vaccines

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM