简体   繁体   English

将段落中句子的每个首字母大写

[英]Capitalize each first word of a sentence in a paragraph

I want to capitilize the first word after a dot in a whole paragraph (str) full of sentences. 我想在整个句子(str)的整个段落(str)中将第一个单词的首字母大写。 The problem is that all chars are lowercase. 问题是所有字符都是小写。

I tried something like this: 我尝试过这样的事情:

text = "here a long. paragraph full of sentences. what in this case does not work. i am lost" 
re.sub(r'(\b\. )([a-zA-z])', r'\1' (r'\2').upper(), text) 

I expect something like this: 我期望这样的事情:

"Here a long. Paragraph full of sentences. What in this case does not work. I am lost." “很长。一段充满句子。这种情况下不起作用。我迷路了。”

You can use re.sub with a lambda : 您可以将re.sublambda一起使用:

import re
text = "here a long. paragraph full of sentences. what in this case does not work. i am lost" 
result = re.sub('(?<=^)\w|(?<=\.\s)\w', lambda x:x.group().upper(), text)

Output: 输出:

'Here a long. Paragraph full of sentences. What in this case does not work. I am lost'

Regex Explanation: 正则表达式说明:

(?<=^)\\w : matches an alphanumeric character preceded by the start of the line. (?<=^)\\w :匹配在行首之前的字母数字字符。

(?<=\\.\\s)\\w : matches an alphanumeric character preceded by a period and a space. (?<=\\.\\s)\\w :匹配字母数字字符,其后带有句点和空格。

You can use ((?:^|\\.\\s)\\s*)([az]) regex ( which doesn't depend upon lookarounds which sometimes may not be available in the regex dialect you may be using and hence is simpler and widely supported. Like for example Javascript doesn't yet widely support lookbehind although it is supported in EcmaScript2018 but its not widely supported yet ) where you capture either the starting zero or more whitespace at the beginning of a sentence or one or more whitespace followed by a literal dot . 您可以使用((?:^|\\.\\s)\\s*)([az])正则表达式( 它不依赖于周围环境,有时您可能正在使用的regex方言中可能不提供这种环视,因此更简单例如,尽管EcmaScript2018中支持Java脚本,但Java尚不广泛支持lookbehind。但是您可以在句子开头捕获零个或多个开头的空白,或在其后捕获一个或多个空白。用文字点表示. and capture it in group1 and next capture a lower case letter using ([az]) and capture in group2 and replace the matched text with group1 captured text and group2 captured letter by making it uppercase using lambda expression. 并在group1中捕获它,然后使用([az])捕获一个小写字母,并在group2中捕获,并使用lambda表达式将匹配的文本替换为group1捕获的文本和group2捕获的字母。 Check this Python code, 检查此Python代码,

import re

arr = ['here a long.   paragraph full of sentences. what in this case does not work. i am lost',
       '   this para contains more than one space after period and also has unneeded space at the start of string.   here a long.   paragraph full of sentences.  what in this case does not work. i am lost']

for s in arr:
    print(re.sub(r'(^\s*|\.\s+)([a-z])', lambda m: m.group(1) + m.group(2).upper(), s))

Output, 输出,

Here a long.   Paragraph full of sentences. What in this case does not work. I am lost
   This para contains more than one space after period and also has unneeded space at the start of string.   Here a long.   Paragraph full of sentences.  What in this case does not work. I am lost

And in case you want to get rid of extra whitespaces and reduce them to just one space, just take that \\s* out of group1 and use this regex ((?:^|\\.\\s))\\s*([az]) and with updated Python code, 并且如果您想摆脱多余的空格并将其减少为一个空格,只需将\\s*从group1中取出并使用此正则表达式((?:^|\\.\\s))\\s*([az])和更新的Python代码,

import re

arr = ['here a long.   paragraph full of sentences. what in this case does not work. i am lost',
       '   this para contains more than one space after period and also has unneeded space at the start of string.   here a long.   paragraph full of sentences.  what in this case does not work. i am lost']

for s in arr:
    print(re.sub(r'((?:^|\.\s))\s*([a-z])', lambda m: m.group(1) + m.group(2).upper(), s))

You get following where extra whitespace is reduced to just one space, which may often be desired, 您会发现,通常需要将多余的空格减少到只有一个空格,

Here a long. Paragraph full of sentences. What in this case does not work. I am lost
This para contains more than one space after period and also has unneeded space at the start of string. Here a long. Paragraph full of sentences. What in this case does not work. I am lost

Also, if this was to be done using PCRE based regex engine, then you could have used \\U in the regex itself without having to use lambda functions and just been able to replace it with \\1\\U\\2 另外,如果要使用基于PCRE的正则表达式引擎来完成此操作,则可以在正则表达式本身中使用\\U ,而不必使用lambda函数,而只需将其替换为\\1\\U\\2

Regex Demo for PCRE based regex 基于PCRE的正则表达式的正则表达式演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM