简体   繁体   English

使用正则表达式从字符串中提取多个单词

[英]extracting multiple words from a string using regex

I am trying to extract all the references from part of a paper as a list. 我正在尝试从一份文件的一部分中提取所有参考文献作为列表。 For now I've just got a paragraph and set it as a string. 现在,我只有一个段落并将其设置为字符串。

I was wondering if it is possible to do this using regex on python? 我想知道是否有可能在python上使用正则表达式来做到这一点? I want to be able to extract multiple words from the string, but so far all I've been able to do is extract the years, singular words, or characters, but not an entire reference at once. 我希望能够从字符串中提取多个单词,但是到目前为止,我所能做的只是提取年份,单个单词或字符,而不是一次提取整个引用。 Also there are quite a lot of conditions really as the references can vary in format, for example: 确实存在很多条件,因为引用的格式可能不同,例如:

text="As shown by Macelroy et al. (1967), bla bla. Podar & Reysenbach (2006) also researched ... Another example is ... (Valdes et al. 2008). Most notably .... Edwards, Bartlett & Stirling (2003)."

So some have the number within a bracket, some are entirely encompassed by brackets, some have multiple capitalised words, some have "et al" and so on. 因此,有些数字包含在方括号内,有些数字完全包含在方括号内,有些则包含多个大写单词,有些则包含“ et al”等。 Is it possible to define all of these requirements within one search, and then print these all out together? 是否可以在一次搜索中定义所有这些要求,然后将它们全部打印出来?

I know there are websites or programs I can put the paper into to extract all the references for me, but I would like to know how to do it myself. 我知道有一些网站或程序可以将其放入其中,以便为我提取所有参考,但我想知道自己如何做。

Thanks 谢谢

NB: Edited to clarify how the references would be embedded in the string 注意:编辑以阐明引用将如何嵌入到字符串中

import re
t = """
As shown by Macelroy et al. (1967), bla bla. Podar
 & Reysenbach (2006) also researched ... Another example is ... (Valdes et al. 2008). Most notably .... Edwards, Bartlett & Stirling (2003).
"""
f = ["".join(result).replace("(","") for result in re.findall("([A-Z])([^A-Z)]+|[^.,]+)([0-9]{4})",t,re.S)]
print(f)
  1. ([AZ]) match a block letter ([AZ])匹配一个印刷体字母
  2. [^AZ)]+|[^.,]+ match two situation , [^ AZ)] + | [^。,] +匹配两种情况,

    • match string which without block letter and ) 匹配字符串未经块函)
    • match a string which did not contain ,. 匹配不包含,.的字符串,. because if contain , or . 因为如果包含,. may match a whole sentence 可能匹配整个句子
  3. [0-9]{4} end with 4 numbers [0-9] {4}以4个数字结尾

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM