简体   繁体   English

python 在所有大写单词上拆分字符串

[英]python split string on the all caps words

I have a series of textfiles formatted as follows:我有一系列格式如下的文本文件:

text = 'COMPANY NAME:   Ruff name of company TYPE OF EVENT: Party NOTIFIED DATE: 1/27/20   COMPANY NAME: Company2/CPT TYPE OF EVENT: Fire NOTIFIED DATE: 1/31/20'

I eventually need to get these into a pandas dataframe where COMPANY NAME , TYPE OF EVENT , NOTIFIED DATE are the column headers and the text in between fill up rows.我最终需要将这些放入 pandas dataframe 其中COMPANY NAMETYPE OF EVENTNOTIFIED DATE是列标题和填充行之间的文本。 A first step is just to figure out how to split the text wherever there is a ":" preceded by one or more all caps words.第一步就是弄清楚如何在有一个“:”前面有一个或多个全大写单词的地方拆分文本。 So, some output like:所以,一些 output 像:

res = ['COMPANY NAME', 'Ruff name of company', 'TYPE OF EVENT', 'PARTY', etc]

I am very new to regex and cannot figure out how to get this match to work.我对正则表达式很陌生,无法弄清楚如何让这个匹配起作用。 I tried the following:我尝试了以下方法:

re.findall('[A-Z]+[A-Z]+[A-Z]', text)

I recognize I'm not even close.我承认我什至没有接近。 I have also looked at lots of other similar questions and failed to adapt them to my use case.我还查看了许多其他类似的问题,但未能使它们适应我的用例。

Other posts:其他帖子:

Capture all consecutive all-caps words with regex in python? 在 python 中使用正则表达式捕获所有连续的全大写单词?

Python Regex catch multi caps words and adjacent words Python 正则表达式捕获多个大写单词和相邻单词

Find the line with all caps in Regex Python 在正则表达式 Python 中找到所有大写字母的行

Any help would be appreciated, thanks!任何帮助将不胜感激,谢谢!

Your values after matching all uppercase chars and a colon : can start with another uppercase char or a digit.匹配所有大写字符和冒号:后的值可以以另一个大写字符或数字开头。

One option is to use re.findall and get the values using 2 capturing groups.一种选择是使用re.findall并使用 2 个捕获组获取值。 This will return tuples of the 2 group values.这将返回 2 个组值的元组。

You might use:你可能会使用:

\b([A-Z]+(?:[^\S\r\n]+[A-Z]+)*):[^\S\r\n]+([A-Z0-9].*?(?= [A-Z]|$))

The pattern will match模式将匹配

  • \b Word boundary \b字边界
  • ( Capture group 1 (捕获组 1
    • [AZ]+ Match 1+ uppercase chars [AZ]+匹配 1+ 个大写字符
    • (?:[^\S\r\n]+[AZ]+)* Optionally repeat 1+ whitespace chars and 1+ uppercase chars (?:[^\S\r\n]+[AZ]+)*可选择重复 1+ 个空白字符和 1+ 个大写字符
  • ): Close group 1 and match the colon ):关闭第 1 组并匹配冒号
  • [^\S\r\n]+ Match 1+ whitespace chars wihout a newline [^\S\r\n]+匹配 1+ 个没有换行符的空白字符
  • ( Capture group 2 (捕获组 2
    • [A-Z0-9] Match an uppercase char AZ or a digit [A-Z0-9]匹配大写字符 AZ 或数字
    • .*? Match any char except a newline as least as possible尽可能匹配除换行符以外的任何字符
    • (?= [AZ]|$) Positve lookahead, assert what is in the right is a space and either an uppercase char AZ or the end of the string. (?= [AZ]|$)正向前瞻,断言右边是一个空格和一个大写字符 AZ 或字符串的结尾。 (use \Z if there can not be a following newline) (如果不能有以下换行符,请使用\Z
  • ) Close group 2 )关闭第 2 组

Regex demo |正则表达式演示| Python demo Python 演示

For example例如

import re

regex = r"\b([A-Z]+(?:[^\S\r\n]+[A-Z]+)*):[^\S\r\n]+([A-Z0-9].*?(?= [A-Z]|$))"
test_str = "COMPANY NAME:   Ruff name of company TYPE OF EVENT: Party NOTIFIED DATE: 1/27/20   COMPANY NAME: Company2/CPT TYPE OF EVENT: Fire NOTIFIED DATE: 1/31/20"
print(re.findall(regex, test_str))

Output Output

[('COMPANY NAME', 'Ruff name of company'), ('TYPE OF EVENT', 'Party'), ('NOTIFIED DATE', '1/27/20  '), ('COMPANY NAME', 'Company2/CPT'), ('TYPE OF EVENT', 'Fire'), ('NOTIFIED DATE', '1/31/20')]

To get all items in a list as in your question, you might also use re.finditer and append the group values to a list.要像您的问题一样获取列表中的所有项目,您还可以使用re.finditer和 append 将组值添加到列表中。 See another Python demo查看另一个Python 演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM