简体   繁体   English

将句子分成单独的字符串,其中句子以大写字母开头

[英]Split sentences into separate strings where sentences start with capital letter

Basically, I want to break up the following string into two separate strings, such that: 基本上,我想将以下字符串分成两个单独的字符串,例如:

Input: 'LIPCIUS, A. grounded out to 3b (1-2 FBF); 输入:'LIPCIUS,A。接地至3b(1-2 FBF); AMMONS advanced to second. AMMONS排名第二。 MOBERG struck out swinging (2-2 BSSFBS).' MOBERG扑出了秋千(2-2 BSSFBS)。

Output: ['LIPCIUS, A. grounded out to 3b (1-2 FBF); 输出:['LIPCIUS,A.接地至3b(1-2 FBF); AMMONS advanced to second.', 'MOBERG struck out swinging (2-2 BSSFBS).'] AMMONS升至第二。”,“ MOBERG挥杆摆动(2-2 BSSFBS)。”]

New sentences is my case will always start with a capital letter (ie the name of the player). 在我的情况下,新句子将始终以大写字母(即玩家的姓名)开头。 Here is my attempt at a code to do this: 这是我尝试执行此操作的代码:

import re

string = 'LIPCIUS, A. grounded out to 3b (1-2 FBF); AMMONS advanced to second. MOBERG struck out swinging (2-2 BSSFBS).'
x = re.findall("[A-Z].*?[\.!?]", string, re.DOTALL)
print(x)

My code currently outputs the following, and the first string in the list is inaccurate: 我的代码当前输出以下内容,并且列表中的第一个字符串不正确:

['LIPCIUS, A.', 'FBF); AMMONS advanced to second.', 'MOBERG struck out swinging (2-2 BSSFBS).']
it should be ['LIPCIUS, A. grounded out to 3b (1-2 FBF); AMMONS advanced to second.','MOBERG struck out swinging (2-2 BSSFBS).']

Regex below should works for you, added optional lookahead assertion of Capital letter or end $ follow by . 下面的正则表达式适合您,在大写字母前添加可选的超前断言或在$后面加上$ . to avoid stopping at A. and B. 避免在A.B.处停留B.

import re
string = 'LIPCIUS, A. grounded out to 3b (1-2 FBF); AMMONS advanced to second. MOBERG struck out swinging (2-2 BSSFBS).'
x = re.findall("[A-Z].*?[\.!?]\s?(?=[A-Z]|$)", string, re.DOTALL)
# ['LIPCIUS, A. grounded out to 3b (1-2 FBF); AMMONS advanced to second. ', 'MOBERG struck out swinging (2-2 BSSFBS).']
import re
s = 'LIPCIUS, A. grounded out to 3b (1-2 FBF); AMMONS advanced to second. MOBERG struck out swinging (2-2 BSSFBS).'
l = re.split(r'[.][ ](?=[A-Z]+\b)', s)
print l

It only does not include the dot after each wanted output item but I guess it won't bother you. 它只在每个想要的输出项之后不包含点,但我想它不会打扰您。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM