简体   繁体   中英

Extract cited bibtex keys from tex file using regex in python

I'm trying to extract cited BibTeX keys from a LaTeX document using regex in python.

I'd like to exclude the citation if it is commented out (% in front) but still include it if there is a percent sign (\\%) in front.

Here is what I came up with so far:

\\(?:no|)cite\w*\{(.*?)\}

An example to try it out:

blablabla
Author et. al \cite{author92} bla bla. % should match
\citep{author93} % should match
\nocite{author94} % should match
100\%\nocite{author95} % should match
100\% \nocite{author95} % should match
%\nocite{author96} % should not match
\cite{author97, author98, author99} % should match
\nocite{*} % should not match

Regex101 testing: https://regex101.com/r/ZaI8kG/2/

I appreciate any help.

Use the newer regex module ( pip install regex ) with the following expression:

(?<!\\)%.+(*SKIP)(*FAIL)|\\(?:no)?citep?\{(?P<author>(?!\*)[^{}]+)\}

See a demo on regex101.com .


More verbose:

 (?<!\\\\)%.+(*SKIP)(*FAIL) # % (not preceded by \\) # and the whole line shall fail | # or \\\\(?:no)?citep? # \\nocite, \\cite or \\citep \\{ # { literally (?P<author>(?!\\*)[^{}]+) # must not start with a star \\} # } literally 


If installing another library is not an option, you need to change the expression to

['author92', 'author93', 'author94', 'author95', 'author95', 'author97, author98, author99']

and need to check programatically if the second capture group has been set (is not empty, that is).
The latter could be in Python :

 import re latex = r""" blablabla Author et. al \\cite{author92} bla bla. % should match \\citep{author93} % should match \\nocite{author94} % should match 100\\%\\nocite{author95} % should match 100\\% \\nocite{author95} % should match %\\nocite{author96} % should not match \\cite{author97, author98, author99} % should match \\nocite{*} % should not match """ rx = re.compile(r'''(?<!\\\\)%.+|(\\\\(?:no)?citep?\\{((?!\\*)[^{}]+)\\})''') authors = [m.group(2) for m in rx.finditer(latex) if m.group(2)] print(authors) 

Which yields

 ['author92', 'author93', 'author94', 'author95', 'author95', 'author97, author98, author99'] 

I'm not following the logic for the last one, seems to me * may not be desired in {} , which in that case, maybe you'd like to design an expression similar to:

^(?!(%\\(?:no)?cite\w*\{([^}]*?)\}))[^*\n]*$

not sure though.

DEMO

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM