简体   繁体   中英

How to extract first paragraph using Regex in python

I need to extract the below first paragraph.

corpus = "CHRISTINE MOORMAN and REBECCA J. SLOTEGRAAF*

 Current interdisciplinary research suggests that organizational capabil-
 ities have a direct, unconditional impact on firm performance. The authors
 extend this literature by developing a framework that proposes a contin-
 gency approach to the value of organizational capabilities. This frame-
 work highlights the effect of information in the external environment in
 stimulating firms to deploy their technology and marketing capabilities to
 influence the level and speed of relevant product development activities.
 

 The Contingency Value of Complementary Capabilities in Product Development."

I was able to select the uppercase letters from the corpus as starting point but unable to set the end point.

my regex pattern to search uppercase is:

r'+[A-Z]\W\s'

The above text was taken from PDF file so it would be really appreciated if i can find a general pattern to fetch similar text from other files.

Please find the sample PDF image: https://ibb.co/FW9V0nk

preferred_output = ['Current interdisciplinary research suggests that organizational capabil-ities have a direct, unconditional impact on firm performance. The authors extend this literature by developing a framework that proposes a contin-gency approach to the value of organizational capabilities. This frame-work highlights the effect of information in the external environment in stimulating firms to deploy their technology and marketing capabilities to
influence the level and speed of relevant product development activities.']

here is your general Regex(this is for your current string style , And it's assumed that the other strings have the same style. Otherwise, you can change the pattern a bit:

import regex as re
x = [re.search(r'\n\n((.*\n)*?\n)', corpus).group(1).replace('\n', '')]
print(x)
  • \n\n find the first paragraph
  • (.*\n) tests for a string starting a new line (first line of paragraph).
  • *?\n match the string till next empty line.(repeat previous regex till where the paragraph ends)
  • () around that makes sure all lines come in one capture group.

OUTPUT:

[' Current interdisciplinary research suggests that organizational capabil- ities have a direct, unconditional impact on firm performance. The authors extend this literature by developing a framework that proposes a contin- gency approach to the value of organizational capabilities. This frame- work highlights the effect of information in the external environment in stimulating firms to deploy their technology and marketing capabilities to influence the level and speed of relevant product development activities. ']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM