简体   繁体   English

如何使Python REGEX匹配表达式的第一次出现 - 它现在扩展到第二次

[英]How to make Python REGEX match first occurrence of the expression - it extends to the second now

I have to following text: 我必须遵循以下文字:

Signatures 35 2 Table of Contents Part I. Financial Information Item 1. Financial  
 Statements Noble Midstream Partners LP Consolidated Statements of Operations and    Comprehensive Income (in thousands except per unit amounts unaudited) Three   Months Ended March 31 2018 2017 Revenues Midstream Services - Affiliate 64263
 50314 Midstream Services -  Net Income Attributable
 to Limited Partners Per Limited Partner Unit - Basic and Diluted Common Units
 0.97 0.77 Subordinated Units 0.97 0.77 Weighted Average Limited Partner Units
 Outstanding - Basic Common Units 23683 15903 Subordinated Units 15903 15903
 Weighted Average Limited Partner Units Outstanding - Diluted Common Units 23698
 15909 Subordinated Units 15903 15903 The accompanying notes are an integral part
 of these financial statements. 3 Table of Contents Noble Midstream Partners LP
 758 The accompanying notes are an integral part of these financial statements.
 4 Table of Contents Noble Midstream Partners LP Consolidated Statements of Cash
 Flows (in thousands unaudited) Three Months Ended March 31 2018 2017 Cash Flows
 From Operating Activities Net Income 39136 34520 Adjustments to Reconcile Net
 Income to Net Cash Provided by Operating Activities Depreciation and 
Amortization 11329 2449 Dividends from Equity Method Investee Net of Income 393 0
 Unit-Based Compensation 321 127 Other Adjustments for Noncash Items Included in
 Income 167 95 Changes in Operating Assets and Liabilities Net of Assets Acquired
 and Liabilities Assumed Increase in Accounts Receivable (2520) (3322) Decrease 
in Accounts Payable (836) (2518) Other Operating Assets and Liabilities Net
 (2387) 874 Net Cash Provided by Operating Activities 45603 32225 Cash Flows 
From Investing Activities Additions to Property Plant and Equipment (161509)
 (32298) Black Diamond Acquisition Net of Cash Acquired (650131) 0 Additions to
 Investments 0 (414) Distributions from Cost Method Investee 419 123 Net Cash 
Used in Investing Activities (811221) (32589) Cash Flows From Financing 
Activities Distributions to Noncontrolling Interests (3007) (11267) Contributions
 from Noncontrolling Interests 409865 7087 Borrowings Under Revolving Credit 
Facility 405000 0 Repayment of Revolving Credit Facility (55000) 0 Distributions
 to Unitholders (19860) (13782) Revolving Credit Facility Amendment Fees and 
Other (1987) (236) Net Cash Provided by (Used in) Financing Activities 735011 
(18198) Decrease in Cash Cash Equivalents and Restricted Cash (30607) (18562)
 Cash Cash Equivalents and Restricted Cash at Beginning of Period 55531 57421
 Cash Cash Equivalents and Restricted Cash at End of Period 24924 38859 The
 accompanying notes are an integral part of these financial statements. 5 Table
 of Contents Noble Midstream Partners LP Consolidated Statement of Changes in 
Equity (in thousands unaudited) Partnership Common Units Subordinated Units 
General Partner Noncontrolling Interests

I need to extract text after words Subordinated units with four numbers that follow this combination of words and until first Cash Flow . 我需要的话后提取文本Subordinated units与后面的话,直到第一个这样的组合四个数字Cash Flow I have constructed the following Regex: 我构建了以下正则表达式:

CONSOLIDATED STATEMENTS? OF OPERATIONS?.+?\sSubordinated units.+?\s(\(?\d*[.]?(\d+)?\)?\s\(?\d*[.]?(\d+)?\)?\s\(?\d*[.]?(\d+)?\)?\s\(?\d*[.]?(\d+)?\)?)

This regex should not find any match as there are only two numbers after expression Subordinated units . 这个正则表达式不应该找到任何匹配,因为表达式Subordinated units后只有两个数字。 However, it manages to match till the end of this Noble Midstream Partners LP Consolidated Statements of Cash Flows (in thousands unaudited) Three Months Ended March 31 2018 2017 which has three numbers, and is second occurrence of Cash Flow . 然而,它设法匹配至此Noble Midstream Partners LP Consolidated Statements of Cash Flows (in thousands unaudited) Three Months Ended March 31 2018 2017的结尾Noble Midstream Partners LP Consolidated Statements of Cash Flows (in thousands unaudited) Three Months Ended March 31 2018 2017有三个数字,并且是Cash Flow第二次出现。 How do I make sure that it catches only exact four numbers and does not extend to the second Cash Flow ? 我如何确保它只捕获确切的四个数字并且不会扩展到第二个Cash Flow

I think this regex might solve your problem. 我认为这个正则表达式可以解决你的问题。 It searches until the first Cash Flows . 它会搜索到第一个Cash Flows

It uses the (?s) modifier to let the dot . 它使用(?s)修饰符来设置dot . match newlines. 匹配换行符。 Think of s in this case as string rather than matching a line . 在这种情况下,将s视为字符串而不是匹配

At first, I was capturing up to the second Cash Flows , but I noticed that the first occurrence had a newline between Cash and Flows . 起初,我正在捕捉第二个Cash Flows ,但我注意到第一次出现了CashFlows之间的换行符。 To correct for this, I wrote Cash\\s+Flows where the 2 words were separated by space (could be a regular space or a newline which is also a space character). 为了解决这个问题,我写了Cash\\s+Flows ,其中2个单词用空格分隔(可以是常规空格换行也是空格字符)。

import re

fin = open('cash_flow.txt', 'r')

text = fin.read()

p = re.compile(r'(?s)(Consolidated Statements of Operations.+?Cash\s+Flows)')

m = p.search(text)

print(m.group(1))

The print out I got was: 我得到的打印出来是:

Consolidated Statements of Operations and    Comprehensive Income (in thousands except per unit amounts unaudited) Three   Months Ended March 31 2018 2017 Revenues Midstream Services - Affiliate 64263
 50314 Midstream Services - Third Party 11360 0 Crude Oil Sales - Third Party
 22110 0 Total Revenues 97733 50314 Costs and Expenses Cost of Crude Oil Sales
 21439 0 Direct Operating 17148 11401 Depreciation and Amortization 11329 2449
 General and Administrative 10442 2742 Total Operating Expenses 60358 16592
 Operating Income 37375 33722 Other (Income) Expense Interest Expense Net of
 Amount Capitalized 1033 267 Investment Income (2868) (1065) Total Other Income
 (1835) (798) Income Before Income Taxes 39210 34520 Income Tax Provision 74 0
 Net Income 39136 34520 Less: Net (Loss) Income Attributable to Noncontrolling
 Interests (225) 10178 Net Income Attributable to Noble Midstream Partners LP
 39361 24342 Less: Net Income Attributable to Incentive Distribution Rights 819 0
 Net Income Attributable to Limited Partners 38542 24342 Net Income Attributable
 to Limited Partners Per Limited Partner Unit - Basic and Diluted Common Units
 0.97 0.77 Subordinated Units 0.97 0.77 Weighted Average Limited Partner Units
 Outstanding - Basic Common Units 23683 15903 Subordinated Units 15903 15903
 Weighted Average Limited Partner Units Outstanding - Diluted Common Units 23698
 15909 Subordinated Units 15903 15903 The accompanying notes are an integral part
 of these financial statements. 3 Table of Contents Noble Midstream Partners LP
 758 The accompanying notes are an integral part of these financial statements.
 4 Table of Contents Noble Midstream Partners LP Consolidated Statements of Cash
 Flows

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM