I have a text file with the following pattern:
Prof. Imperdiet montes, metus elementum eleifend eget eget adipiscing augue.
Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing
A, nec, quam eleifend quis, magnis sit pretium. leo augue. amet, elit. vel
Vel, dis eget nascetur justo. imperdiet consequat et sit Nam Aenean a, Quisque
Enim. a, dui. Aenean lorem Phasellus commodo quis, pretium ultricies nascetur
tincidunt. sem. vitae,
montes, tellus. amet, venenatis natoque enim. fringilla
quis, vitae, Aenean Etiam viverra ipsum dapibus ut elementum Aenean Lorem eget,
nisi mollis Curabitur Quisque Aenean rhoncus sociis justo, sem. justo, vel
Aenean ultricies nec, eu laoreet.
Dr. Enim. vitae, feugiat in, Aenean
Abstract title: Massa. sociis dis dapibus dolor semper ipsum
jalor
Semper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet
eleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla
ligula vulputate ac, nisi. enim dapibus. Donec metus In sit dolor Nam ultricies
imperdiet. pellentesque Cras eu, massa quis porttitor parturient varius ut,
Phasellus arcu. pretium. quam augue. eu, adipiscing felis, enim. ante,
vulputate Integer dui. ultricies a, dictum rutrum. Nullam nec, quis,
consequat Cum tellus. dis felis dolor. nulla Aliquam Donec massa. justo. in,
nascetur
Semper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet
eleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla
Dr. Justo. nisi elementum ante, Donec Aenean Nulla
Abstract title:
Aenean consectetuer leo penatibus eget imperdiet nisi. consequat
lorem pretium mus.
Prof. Dr. Aliquam metus semper
Abstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum
eleifend
More information will be available soon.
I want to extract these parts:
Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing
Abstract title: Massa. sociis dis dapibus dolor semper ipsum jalor
Abstract title:
and
Abstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi.ipsum eleifend More information will be available soon.
Now, I found these are helpful:
but (?<=(Abstract title:))(.*)(?=\\n{2})
returns only
Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing
and
Abstract title:
Also I am not sure what software tool would be most efficient – awk , shell , r ? Please forgive if it's noob question but I am open to suggestions.
In R, you can extract your matches and "normalize" all whitespace inside matches to a regular single space using
x <- "Prof. Imperdiet montes, metus elementum eleifend eget eget adipiscing augue.\nAbstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing\n\nA, nec, quam eleifend quis, magnis sit pretium. leo augue. amet, elit. vel\n\nVel, dis eget nascetur justo. imperdiet consequat et sit Nam Aenean a, Quisque\nEnim. a, dui. Aenean lorem Phasellus commodo quis, pretium ultricies nascetur\ntincidunt. sem. vitae,\nmontes, tellus. amet, venenatis natoque enim. fringilla\nquis, vitae, Aenean Etiam viverra ipsum dapibus ut elementum Aenean Lorem eget,\nnisi mollis Curabitur Quisque Aenean rhoncus sociis justo, sem. justo, vel\nAenean ultricies nec, eu laoreet.\n\nDr. Enim. vitae, feugiat in, Aenean\nAbstract title: Massa. sociis dis dapibus dolor semper ipsum\njalor\n\nSemper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet\neleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla\nligula vulputate ac, nisi. enim dapibus. Donec metus In sit dolor Nam ultricies\nimperdiet. pellentesque Cras eu, massa quis porttitor parturient varius ut,\nPhasellus arcu. pretium. quam augue. eu, adipiscing felis, enim. ante,\nvulputate Integer dui. ultricies a, dictum rutrum. Nullam nec, quis,\nconsequat Cum tellus. dis felis dolor. nulla Aliquam Donec massa. justo. in,\nnascetur\nSemper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet\neleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla\n\n\nDr. Justo. nisi elementum ante, Donec Aenean Nulla\nAbstract title:\n\nAenean consectetuer leo penatibus eget imperdiet nisi. consequat\nlorem pretium mus. \n\nProf. Dr. Aliquam metus semper\nAbstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum\neleifend\nMore information will be available soon.\n"
library(stringr)
pattern <- "(?<=Abstract title:).*(?:\n(?!\n).*)*"
results <- lapply(str_extract_all(x, pattern), function(z) trimws(gsub("\\s+", " ", z)))
The results
will look like
[[1]]
[1] "Lorem ipsum dolor sit amet, consectetuer adipiscing"
[2] "Massa. sociis dis dapibus dolor semper ipsum jalor"
[3] ""
[4] "Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum eleifend More information will be available soon."
See the R demo online and the regex demo .
Regex details :
(?<=Abstract title:)
- a positive lookbehind that matches a position that is immediately preceded with Abstract title:
.*
- any zero or more chars other than line break chars as many as possible (?:\\n(?!\\n).*)*
- zero or more sequences of
\\n(?!\\n)
- a line feed char not immediately followed with another line feed char .*
- any zero or more chars other than line break chars as many as possible The lapply(..., function(z) trimws(gsub("\\\\s+", " ", z)))
"shrinks" the whitespace in the resulting list.
Parsing the text file into two columns
You can use
library(readr)
library(stringr)
file <- read_lines(path)
file_string <- paste(file, collapse="\n")
pattern <- "(?m)^(.+)\n(Abstract title:.*(?:\n(?!\n).*)*)"
res <- str_match_all(file_string, pattern)
res <- lapply(res, function(z) trimws(gsub("\\s+", " ", z[,-1])))
The output is
[[1]]
[,1] [,2]
[1,] "Prof. Imperdiet montes, metus elementum eleifend eget eget adipiscing augue." "Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing"
[2,] "Dr. Enim. vitae, feugiat in, Aenean" "Abstract title: Massa. sociis dis dapibus dolor semper ipsum jalor"
[3,] "Dr. Justo. nisi elementum ante, Donec Aenean Nulla" "Abstract title:"
[4,] "Prof. Dr. Aliquam metus semper" "Abstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum eleifend More information will be available soon."
Try this Regex,
Abstract title:(?:.|\\r?\\n\\w)*
It captures everything like:
Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing
Abstract title: Massa. sociis dis dapibus dolor semper ipsum jalor
Abstract title:
Abstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi.ipsum eleifend More information will be available soon.
(As you mentioned in your question)
tell me if its okay for you...
Firstly, sorry for the late reply as I am not as fast as most of you here XD. As you can see from the discussion in In R, you can extract your matches and "normalize" all whitespace inside matches to a regular single space , I actually posted only part of the problem otherwise it would have been too complicated IMO. So, here I put the pieces together. Whomever commented or answered, has been helpful. I accept the answer as he spent a lot of time & guided me to the right direction.
GOAL Please take a look at the MWO in the question, for reference. I need to extract three individual columns from a text file, which was extracted using pdftotext
on a pdf
with texts and images. The end result should be a data frame with three columns. The three columns will be, for example
Column 1.
Dr. Enim. vitae, feugiat in, Aenean
Column 2.
Abstract title: Massa. sociis dis dapibus dolor semper ipsum jalor
Column 3.
Semper tincidunt. ullamcorper ... viverra pede elit. eget aliquet eleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla
So, the newlines need to be kept intact which Wiktor Stribiżew and I realised. So, it's hard to do with grep , sed and awk . I took cue from Wiktor Stribiżew's answer with r .
Reading the pattern & removing of carriage return \\f
or U+000C
wholeFileAsString <- read_file(file="pattern.txt")
wholeFileAsString <- wholeFileAsString %>% str_remove("\f");
testString <- "Prof. Imperdiet montes, metus elementum eleifend eget eget adipiscing augue.
Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing
A, nec, quam eleifend quis, magnis sit pretium. leo augue. amet, elit. vel
Vel, dis eget nascetur justo. imperdiet consequat et sit Nam Aenean a, Quisque
Enim. a, dui. Aenean lorem Phasellus commodo quis, pretium ultricies nascetur
tincidunt. sem. vitae,
montes, tellus. amet, venenatis natoque enim. fringilla
quis, vitae, Aenean Etiam viverra ipsum dapibus ut elementum Aenean Lorem eget,
nisi mollis Curabitur Quisque Aenean rhoncus sociis justo, sem. justo, vel
Aenean ultricies nec, eu laoreet.
Dr. Enim. vitae, feugiat in, Aenean
Abstract title: Massa. sociis dis dapibus dolor semper ipsum
jalor
Semper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet
eleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla
ligula vulputate ac, nisi. enim dapibus. Donec metus In sit dolor Nam ultricies
imperdiet. pellentesque Cras eu, massa quis porttitor parturient varius ut,
Phasellus arcu. pretium. quam augue. eu, adipiscing felis, enim. ante,
vulputate Integer dui. ultricies a, dictum rutrum. Nullam nec, quis,
consequat Cum tellus. dis felis dolor. nulla Aliquam Donec massa. justo. in,
nascetur
Semper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet
eleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla
Dr. Justo. nisi elementum ante, Donec Aenean Nulla
Abstract title:
Aenean consectetuer leo penatibus eget imperdiet nisi. consequat
lorem pretium mus.
Prof. Dr. Aliquam metus semper
Abstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum
eleifend
More information will be available soon."
AbstractTitle
and Name
columns individually.pattern_AbstractTitle <- "(?m)(^Abstract.*\n){1}(.*\n)";
pattern_AbstractTitle <- "(?m)(^Abstract.*\n){1}^((?!More).)*";
AbstractTitle <- str_extract_all(wholeFileAsString, pattern_AbstractTitle) %>%
unlist() %>% as.vector();
pattern_Name <- "(?m)(?<=\n\n)\f?(?:^.*\n)(.*\n)?(?=^Ab(s?)tract.*)";
Name <- wholeFileAsString %>% str_extract_all(pattern_Name) %>% unlist() %>%
as.vector() %>% c("Prof. Mauro Alini, AO Research Institute", .);
AbstractTitle <- append(AbstractTitle,
"Abstract title: more information will follow soon\n", after=
length(AbstractTitle));
AbstractTitle
and Name
lines only).pattern_NameAbstractTitle <-
"(?m)(?<=\n\n)\f?(?:^.*\n)(.*\n)?(?=^Ab(s?)tract.*)(^Abstract.*\n){1}^((?!More).)*"
wholeFileAsString_wo_NameAbstractTitle <- wholeFileAsString %>%
str_remove_all(pattern_NameAbstractTitle) %>% as.vector();
.txt
https://stackoverflow.com/a/21481387/9592557
# write.table(wholeFileAsString_wo_NameAbstractTitle, file = "outfile.txt")
writeLines(wholeFileAsString_wo_NameAbstractTitle,
"wholeFileAsString_wo_NameAbstractTitle.txt")
# read_file(file = "wholeFileAsString_wo_NameAbstractTitle.txt")
wholeFileAsString_wo_NameAbstractTitle %>% str()
shortBio <- wholeFileAsString_wo_NameAbstractTitle %>% str_remove_all("\f") %>%
str_replace_all("(?m)\n(?!\n)","")
shortBio2 <- shortBio %>% str_replace_all("(?m)\n+","\n") %>%
str_extract_all(".*\n") %>% unlist(); str(shortBio); shortBio2;
shortBio3 <- shortBio2 %>% as.vector(); shortBio3 <-
shortBio3[2:length(shortBio3)];
shortBio4 <- shortBio3[-1]
.txt
files.shortBio4 %>% writeLines("shortBio4.txt");
dfNameAbstractTitle <- data.frame(Name, AbstractTitle);
dfNameAbstractTitle <- dfNameAbstractTitle %>% rbind(c(NA, NA));
dfNameAbstractTitleShortBio <- dfNameAbstractTitle %>% data.frame(shortBio4)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.