Extract all text from \section tag in LaTeX

Question

How do I extract only section text from from the following LaTeX?

\\section{Introduction}

Direct detection techniques, like those that discovered the prototype T dwarf Gl 229B \\citep{Na95,Op99}, have been used for the last 15 years to search for brown dwarfs around nearby stars\\footnote{For a review of these companion searches see \\citet{Op99}}. Despite the large samples involved, only two {\\it bona fide} brown dwarf companions have been directly detected, Gl 229B and the young L-type brown dwarf G 196-3B \\citep{Re98}\\footnote{The companion object GD 165B \\citep{Be89} may also be a brown dwarf, although its status is questionable \\citep{Ki99b}.}. Since most of these searches have been confined to a narrow field of view around the primary (typically 10-60$\\arcsec$), widely separated companions\\footnote{We adopt an observational definition for ``widely separated'' as angular separation greater than 100$\\arcsec$; see \\citet{Fi92}.} may be missed. Indeed, both G 196-3B and Gl 229B are less than 20$\\arcsec$ from their primary. Field surveys, such as the Two Micron All Sky Survey \\citep[hereafter 2MASS]{Sk97}, the DEep Near Infrared Survey \\citep[hereafter DENIS]{Ep97}, and the Sloan Digital Sky Survey \\citep[hereafter SDSS]{Yk99}, overcome this limitation. Indeed, \\citet{Ki00} have recently identified two L-type brown dwarf companions at wide separation.

We are currently searching the 2MASS catalogs for field T dwarfs \\citep{Bg98}, brown dwarfs spectrally identified by CH$ 4$ absorption bands at 1.6 and 2.2 $\\micron$ \\citep{Ki99a}. One of our discoveries, 2MASSW J1457150-212148 (hereafter Gl 570D), has been confirmed as a widely separated, common proper motion companion to the Gl 570ABC system. This system is comprised of a K4V primary and a M1.5V-M3V close binary \\citep{Du88,Mi90,Fo99} at a distance of 5.91$\\pm$0.06 pc \\citep{Pe97}. In $\\S$2 we describe the selection of this object from the 2MASS database, review subsequent observations, and establish its common proper motion with Gl 570ABC. In $\\S$3 we estimate L and T$ {eff}$ of Gl 570D based on its distance and brightness, and make T$_{eff}$ and mass estimates using the evolutionary models of \\citet{Bu97}.

(I'm mainly concerned about removing the tags with \\ , I'm not sure what to do with the $ tags yet).

I've tried with TexSoup:

with open("latex/" + file) as f:
    try:
        soup = TexSoup(f)
        # Parse article body if .tex is a document, defined by /begin{document}
        # Any errors in LaTeX formatting will result in file discard
        if soup.document:
            # If a \section or \subsection tag
            lastChildIsSection = False
            for child in soup.document.contents:
                # If last child was \section or \subsection and current child is text,
                if lastChildIsSection and isinstance(child, str):
                    # Get text
                    print(child)

                # Check if \section or \subsection
                if type(child).__name__ == 'TexNode' and (child.name == 'section' or child.name == 'subsection'):
                    lastChildIsSection = True
                else:
                    lastChildIsSection = False

This looks for \\section tags and then gets the next child, which is a str representing some text. Unfortunately this isn't all of the text in the section, since the text ends as soon as the next tag appears, in this case a \\cite tag.

How can I extract all of the text from the section? I'm open to any approach, including regular expressions.

Answer 1

I'm not sure how to do it in one go, so I would first extract everything belonging to a section with a regex like \\\\section{[\\w\\s]*}(.*)\\\\section{ (*this may need to be improved if more than word-characters or whitespace could appear in a section-title). And then I would replace all tags in that result with an exmpty string. I'm not sure about TeX-Syntax and rx in Python, but something along the lines of \\\\\\w*{.*} (non-greedy mode) might do it. Sorry, not a complete solution, I'm too unfamiliary with both TeX and Python, but I hope you'll get some ideas ;-)

Answer 2

Tex2py, built off of TexSoup, will do the job for you https://github.com/alvinwan/tex2py . Say we have the following:

We only want the text (not the \\section{...} , \\subsection{...} ). However, this includes bulleted lists, bolded words, citations etc.
Note that there are two \\section s. Say we only want text from the first one. (we can fetch all of them, but adding this constraint just to make it harder)

Say the following is saved in a variable called data

\section{Chikin Tales}

\subsection{Chikin Fly}

Chickens don't fly. They do only the following:

\begin{itemize}
\item waddle
\item plop
\end{itemize}

\section{Chikin Scream}

\subsection{Plopping}

Plopping involves three steps:

We can process it using the following.

>>> toc = tex2py(data)
>>> print([text for text in toc.section.subsection.descendants])
["\n\nChickens don't fly. They do only the following:\n\n", \begin{itemize}

\item waddle
\item plop\end{itemize}]

Here, toc.section grabs the first section, but we can use toc.sections to grab all sections. Admittedly, whitespace isn't preserved at the moment.

Extract all text from \section tag in LaTeX

Question

2 answers

solution1
0 2018-03-25 09:39:57

solution2
0 2018-05-03 09:31:47

Extract all text from \section tag in LaTeX

Question

2 answers

solution1 0 2018-03-25 09:39:57

solution2 0 2018-05-03 09:31:47

solution1
0 2018-03-25 09:39:57

solution2
0 2018-05-03 09:31:47