Extract specific section from LaTeX file with python

Question

I have a set of LaTeX files. I would like to extract the "abstract" section for each one:

\begin{abstract}

.....

\end{abstract}

I have tried the suggestion here: How to Parse LaTex file

And tried :

A = re.findall(r'\\begin{abstract}(.*?)\\end{abstract}', data)

Where data contains the text from the LaTeX file. But A is just an empty list. Any help would be greatly appreciated!

Answer 1

.* does not match newlines unless the re.S flag is given:

re.findall(r'\\begin{abstract}(.*?)\\end{abstract}', data, re.S)

Example

Consider this test file:

\documentclass{report}
\usepackage[margin=1in]{geometry}
\usepackage{longtable}

\begin{document}
Title maybe
\begin{abstract}
Good stuff
\end{abstract}
Other stuff
\end{document}

This gets the abstract:

>>> import re
>>> data = open('a.tex').read()
>>> re.findall(r'\\begin{abstract}(.*?)\\end{abstract}', data, re.S)
['\nGood stuff\n']

Documentation

From the re module's webpage :

re.S
re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

Answer 2

The . does not match newline character. However, you can pass a flag to ask it to include newlines.

Example:

import re

s = r"""\begin{abstract}
this is a test of the
linebreak capture.
\end{abstract}"""

pattern = r'\\begin\{abstract\}(.*?)\\end\{abstract\}'

re.findall(pattern, s, re.DOTALL)

#output:
['\nthis is a test of the\nlinebreak capture.\n']

Extract specific section from LaTeX file with python

Question

2 answers

solution1
7 ACCPTED 2016-08-26 01:45:00

Example

Documentation

solution2
2 2016-08-26 01:48:49

Extract specific section from LaTeX file with python

Question

2 answers

solution1 7 ACCPTED 2016-08-26 01:45:00

Example

Documentation

solution2 2 2016-08-26 01:48:49

solution1
7 ACCPTED 2016-08-26 01:45:00

solution2
2 2016-08-26 01:48:49