简体   繁体   English

正则表达式-获取两个字符串之间的文本

[英]Regex - Get text between two strings

I have a large text file which contains many abstracts (7k of them). 我有一个很大的文本文件,其中包含许多摘要(其中7k个)。 I want to separate them. 我想将它们分开。 They have the following properties: 它们具有以下属性:

a number at the begining with a period right after 以开头的数字开头

123. 123。

and it always ends in: 它总是以:

[PubMed - indexed for MEDLINE] [PubMed-为MEDLINE编制索引]

It would be even better if I can get the title and abstract out of the separated string. 如果我可以从分开的字符串中获取标题和摘要,那就更好了。 I am fine if I have to split the articles first then split the texts. 如果我必须先拆分文章,然后再拆分文本,那很好。

In the example the title is the third line: 在示例中,标题是第三行:

Effects of propofol and isoflurane on haemodynamics and the inflammatory response in cardiopulmonary bypass surgery.

The abstract is on the 8th line: 摘要在第八行:

Cardiopulmonary bypass (CPB) causes reperfusion injury...

I have tried to use the following code for this text 我已尝试将以下代码用于此文本

Regex: 正则表达式:

[0-9\.]*\s*(((?![0-9\.]*|MEDLINE).)+)\s*MEDLINE

Text: 文本:

1. Br J Biomed Sci. 2015;72(3):93-101.

Effects of propofol and isoflurane on haemodynamics and the inflammatory response
in cardiopulmonary bypass surgery.

Sayed S, Idriss NK, Sayyedf HG, Ashry AA, Rafatt DM, Mohamed AO, Blann AD.

Cardiopulmonary bypass (CPB) causes reperfusion injury that when most severe is
clinically manifested as a systemic inflammatory response syndrome. The
anaesthetic propofol may have anti-inflammatory properties that may reduce such a
response. We hypothesised differing effects of propofol and isoflurane on
inflammatory markers in patients having CBR Forty patients undergoing elective
CPB were randomised to receive either propofol or isoflurane for maintenance of
anaesthesia. CRP, IL-6, IL-8, HIF-1α (ELISA), CD11 and CD18 expression (flow
cytometry), and haemoxygenase (HO-1) promoter polymorphisms (PCR/electrophoresis)
were measured before anaesthetic induction, 4 hours post-CPB, and 24 hours later.
There were no differences in the 4 hours changes in CRP, IL-6, IL-8 or CD18
between the two groups, but those in the propofol group had higher HIF-1α (P =
0.016) and lower CD11 expression (P = 0.026). After 24 hours, compared to the
isoflurane group, the propofol group had significantly lower levels of CRP (P <
0.001), IL-6 (P < 0.001) and IL-8 (P < 0.001), with higher levels CD11 (P =
0.009) and CD18 (P = 0.002) expression. After 24 hours, patients on propofol had 
increased expression of shorter HO-1 GT(n) repeats than patients on isoflurane (P
= 0.001). Use of propofol in CPB is associated with a less adverse inflammatory
profile than is isofluorane, and an increased up-regulation of HO-1. This
supports the hypothesis that propofol has anti-inflammatory activity.

PMID: 26510263  [PubMed - indexed for MEDLINE]

Try this: 尝试这个:

"^[0-9]+\..*\s+(.*)\s+.*\s+((?:\s|.)*?)\[PubMed - indexed for MEDLINE\]"

First group would be title. 第一组是标题。 Second would be abstract. 第二个是抽象的。

Two useful solutions have been proposed by Mariano and stribizhev : Marianostribizhev提出了两个有用的解决方案:

Mariano's solution: Use the split method with the typical end Mariano的解决方案:将split方法用于典型末端

(?m)\[PubMed - indexed for MEDLINE\]$

DEMO : http://ideone.com/Qw5ss2 演示: http : //ideone.com/Qw5ss2

Java 4+ Java 4+

stribizhev's solution: Fully extract data from the text stribizhev的解决方案:从文本中完全提取数据

(?m)^\s*\d+\..*\R{2}                 # Get to the title
(?<title>[^\n]*(?:\n(?!\n)[^\n]*)*)  # Get title
\R{2}                                # Get to the authors
[^\n]*(?:\n(?!\R)[^\R]*)*            # Consume authors
(?<abstract>[^\[]*(?:\[(?!PubMed[ ]-[ ]indexed[ ]for[ ]MEDLINE\])[^\[]*)*) #Grab abstract

DEMO: https://regex101.com/r/sG2yQ2/2 演示: https : //regex101.com/r/sG2yQ2/2

Java 8+ Java 8+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM