简体   繁体   中英

Regex between two specific patterns including newline

I have a text file with the following pattern:

Prof. Imperdiet montes, metus elementum eleifend eget eget adipiscing augue.
Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing

A, nec, quam eleifend quis, magnis sit pretium. leo augue. amet, elit. vel
Vel, dis eget nascetur justo. imperdiet consequat et sit Nam Aenean a, Quisque
Enim. a, dui. Aenean lorem Phasellus commodo quis, pretium ultricies nascetur
tincidunt. sem. vitae,
montes, tellus. amet, venenatis natoque enim. fringilla
quis, vitae, Aenean Etiam viverra ipsum dapibus ut elementum Aenean Lorem eget,
nisi mollis Curabitur Quisque Aenean rhoncus sociis justo, sem. justo, vel
Aenean ultricies nec, eu laoreet.

Dr. Enim. vitae, feugiat in, Aenean
Abstract title: Massa. sociis dis dapibus dolor semper ipsum
jalor

Semper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet
eleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla
ligula vulputate ac, nisi. enim dapibus. Donec metus In sit dolor Nam ultricies
imperdiet. pellentesque Cras eu, massa quis porttitor parturient varius ut,
Phasellus arcu. pretium. quam augue. eu, adipiscing felis, enim. ante,
vulputate Integer dui. ultricies a, dictum rutrum. Nullam nec, quis,
consequat Cum tellus. dis felis dolor. nulla Aliquam Donec massa. justo. in,
nascetur
Semper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet
eleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla


Dr. Justo. nisi elementum ante, Donec Aenean Nulla
Abstract title:

Aenean consectetuer leo penatibus eget imperdiet nisi. consequat
lorem pretium mus. 

Prof. Dr. Aliquam metus semper
Abstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum
eleifend
More information will be available soon.

I want to extract these parts:

Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing

Abstract title: Massa. sociis dis dapibus dolor semper ipsum jalor

Abstract title:

and

Abstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi.ipsum eleifend More information will be available soon.

Now, I found these are helpful:

but (?<=(Abstract title:))(.*)(?=\\n{2}) returns only

Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing

and

Abstract title:

Also I am not sure what software tool would be most efficient – , , ? Please forgive if it's noob question but I am open to suggestions.

In R, you can extract your matches and "normalize" all whitespace inside matches to a regular single space using

x <- "Prof. Imperdiet montes, metus elementum eleifend eget eget adipiscing augue.\nAbstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing\n\nA, nec, quam eleifend quis, magnis sit pretium. leo augue. amet, elit. vel\n\nVel, dis eget nascetur justo. imperdiet consequat et sit Nam Aenean a, Quisque\nEnim. a, dui. Aenean lorem Phasellus commodo quis, pretium ultricies nascetur\ntincidunt. sem. vitae,\nmontes, tellus. amet, venenatis natoque enim. fringilla\nquis, vitae, Aenean Etiam viverra ipsum dapibus ut elementum Aenean Lorem eget,\nnisi mollis Curabitur Quisque Aenean rhoncus sociis justo, sem. justo, vel\nAenean ultricies nec, eu laoreet.\n\nDr. Enim. vitae, feugiat in, Aenean\nAbstract title: Massa. sociis dis dapibus dolor semper ipsum\njalor\n\nSemper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet\neleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla\nligula vulputate ac, nisi. enim dapibus. Donec metus In sit dolor Nam ultricies\nimperdiet. pellentesque Cras eu, massa quis porttitor parturient varius ut,\nPhasellus arcu. pretium. quam augue. eu, adipiscing felis, enim. ante,\nvulputate Integer dui. ultricies a, dictum rutrum. Nullam nec, quis,\nconsequat Cum tellus. dis felis dolor. nulla Aliquam Donec massa. justo. in,\nnascetur\nSemper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet\neleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla\n\n\nDr. Justo. nisi elementum ante, Donec Aenean Nulla\nAbstract title:\n\nAenean consectetuer leo penatibus eget imperdiet nisi. consequat\nlorem pretium mus. \n\nProf. Dr. Aliquam metus semper\nAbstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum\neleifend\nMore information will be available soon.\n"
library(stringr)
pattern <- "(?<=Abstract title:).*(?:\n(?!\n).*)*"
results <- lapply(str_extract_all(x, pattern), function(z) trimws(gsub("\\s+", " ", z)))

The results will look like

[[1]]
[1] "Lorem ipsum dolor sit amet, consectetuer adipiscing"                                                                        
[2] "Massa. sociis dis dapibus dolor semper ipsum jalor"                                                                         
[3] ""                                                                                                                           
[4] "Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum eleifend More information will be available soon."

See the R demo online and the regex demo .

Regex details :

  • (?<=Abstract title:) - a positive lookbehind that matches a position that is immediately preceded with Abstract title:
  • .* - any zero or more chars other than line break chars as many as possible
  • (?:\\n(?!\\n).*)* - zero or more sequences of
    • \\n(?!\\n) - a line feed char not immediately followed with another line feed char
    • .* - any zero or more chars other than line break chars as many as possible

The lapply(..., function(z) trimws(gsub("\\\\s+", " ", z))) "shrinks" the whitespace in the resulting list.

Parsing the text file into two columns

You can use

library(readr)
library(stringr)
file <- read_lines(path)
file_string <- paste(file, collapse="\n")
pattern <- "(?m)^(.+)\n(Abstract title:.*(?:\n(?!\n).*)*)"
res <- str_match_all(file_string, pattern)
res <- lapply(res, function(z) trimws(gsub("\\s+", " ", z[,-1])))

The output is

[[1]]
     [,1]                                                                           [,2]                                                                                                                                         
[1,] "Prof. Imperdiet montes, metus elementum eleifend eget eget adipiscing augue." "Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing"                                                                        
[2,] "Dr. Enim. vitae, feugiat in, Aenean"                                          "Abstract title: Massa. sociis dis dapibus dolor semper ipsum jalor"                                                                         
[3,] "Dr. Justo. nisi elementum ante, Donec Aenean Nulla"                           "Abstract title:"                                                                                                                            
[4,] "Prof. Dr. Aliquam metus semper"                                               "Abstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum eleifend More information will be available soon."

Try this Regex,

Abstract title:(?:.|\\r?\\n\\w)*

It captures everything like:

  • Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing

  • Abstract title: Massa. sociis dis dapibus dolor semper ipsum jalor

  • Abstract title:

  • Abstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi.ipsum eleifend More information will be available soon.

(As you mentioned in your question)

Regex101 Demo

tell me if its okay for you...

Firstly, sorry for the late reply as I am not as fast as most of you here XD. As you can see from the discussion in In R, you can extract your matches and "normalize" all whitespace inside matches to a regular single space , I actually posted only part of the problem otherwise it would have been too complicated IMO. So, here I put the pieces together. Whomever commented or answered, has been helpful. I accept the answer as he spent a lot of time & guided me to the right direction.

GOAL Please take a look at the MWO in the question, for reference. I need to extract three individual columns from a text file, which was extracted using pdftotext on a pdf with texts and images. The end result should be a data frame with three columns. The three columns will be, for example

Column 1.

Dr. Enim. vitae, feugiat in, Aenean

Column 2.

Abstract title: Massa. sociis dis dapibus dolor semper ipsum jalor

Column 3.

Semper tincidunt. ullamcorper ... viverra pede elit. eget aliquet eleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla

So, the newlines need to be kept intact which Wiktor Stribiżew and I realised. So, it's hard to do with , and . I took cue from Wiktor Stribiżew's answer with r .

read the pattern such as the one provided in the question

Reading the pattern & removing of carriage return \\f or U+000C

wholeFileAsString <- read_file(file="pattern.txt")

wholeFileAsString <- wholeFileAsString %>% str_remove("\f");
testString <- "Prof. Imperdiet montes, metus elementum eleifend eget eget adipiscing augue.
Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing

A, nec, quam eleifend quis, magnis sit pretium. leo augue. amet, elit. vel
Vel, dis eget nascetur justo. imperdiet consequat et sit Nam Aenean a, Quisque
Enim. a, dui. Aenean lorem Phasellus commodo quis, pretium ultricies nascetur
tincidunt. sem. vitae,
montes, tellus. amet, venenatis natoque enim. fringilla
quis, vitae, Aenean Etiam viverra ipsum dapibus ut elementum Aenean Lorem eget,
nisi mollis Curabitur Quisque Aenean rhoncus sociis justo, sem. justo, vel
Aenean ultricies nec, eu laoreet.

Dr. Enim. vitae, feugiat in, Aenean
Abstract title: Massa. sociis dis dapibus dolor semper ipsum
jalor

Semper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet
eleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla
ligula vulputate ac, nisi. enim dapibus. Donec metus In sit dolor Nam ultricies
imperdiet. pellentesque Cras eu, massa quis porttitor parturient varius ut,
Phasellus arcu. pretium. quam augue. eu, adipiscing felis, enim. ante,
vulputate Integer dui. ultricies a, dictum rutrum. Nullam nec, quis,
consequat Cum tellus. dis felis dolor. nulla Aliquam Donec massa. justo. in,
nascetur
Semper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet
eleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla


Dr. Justo. nisi elementum ante, Donec Aenean Nulla
Abstract title:

Aenean consectetuer leo penatibus eget imperdiet nisi. consequat
lorem pretium mus. 

Prof. Dr. Aliquam metus semper
Abstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum
eleifend
More information will be available soon."

Extracting the AbstractTitle and Name columns individually.

pattern_AbstractTitle <- "(?m)(^Abstract.*\n){1}(.*\n)";
pattern_AbstractTitle <- "(?m)(^Abstract.*\n){1}^((?!More).)*";
AbstractTitle <- str_extract_all(wholeFileAsString, pattern_AbstractTitle) %>% 
    unlist() %>% as.vector();
pattern_Name <- "(?m)(?<=\n\n)\f?(?:^.*\n)(.*\n)?(?=^Ab(s?)tract.*)";
Name <- wholeFileAsString %>% str_extract_all(pattern_Name) %>% unlist() %>% 
    as.vector() %>% c("Prof. Mauro Alini, AO Research Institute", .);
AbstractTitle <- append(AbstractTitle, 
    "Abstract title: more information will follow soon\n", after=
    length(AbstractTitle));

Extracting the rest ( without the AbstractTitle and Name lines only).

pattern_NameAbstractTitle <- 
"(?m)(?<=\n\n)\f?(?:^.*\n)(.*\n)?(?=^Ab(s?)tract.*)(^Abstract.*\n){1}^((?!More).)*"
wholeFileAsString_wo_NameAbstractTitle <- wholeFileAsString %>% 
    str_remove_all(pattern_NameAbstractTitle) %>% as.vector();

Writing the whole string into a .txt

https://stackoverflow.com/a/21481387/9592557

# write.table(wholeFileAsString_wo_NameAbstractTitle, file = "outfile.txt")
writeLines(wholeFileAsString_wo_NameAbstractTitle, 
           "wholeFileAsString_wo_NameAbstractTitle.txt")

Preparing the columns with short biography

# read_file(file = "wholeFileAsString_wo_NameAbstractTitle.txt")
wholeFileAsString_wo_NameAbstractTitle %>% str()
shortBio <- wholeFileAsString_wo_NameAbstractTitle %>% str_remove_all("\f") %>% 
    str_replace_all("(?m)\n(?!\n)","")
shortBio2 <- shortBio %>% str_replace_all("(?m)\n+","\n") %>% 
    str_extract_all(".*\n") %>% unlist(); str(shortBio); shortBio2; 
shortBio3 <- shortBio2 %>% as.vector(); shortBio3 <-
    shortBio3[2:length(shortBio3)];
shortBio4 <- shortBio3[-1]

Writing the character strings into .txt files.

shortBio4 %>% writeLines("shortBio4.txt");

Joining into a three-columned dataframe

dfNameAbstractTitle <- data.frame(Name, AbstractTitle);
dfNameAbstractTitle <- dfNameAbstractTitle %>% rbind(c(NA, NA));
dfNameAbstractTitleShortBio <- dfNameAbstractTitle %>% data.frame(shortBio4)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM