簡體   English   中英

從文本中提取 doi(數字對象標識符)

[英]Extract doi (digital object identifier) from text

我有一個文本塊,還有數千個,其中包含對某些研究的引用。 其中一個示例如下所示:

txt = '<div>1. <em>Nationella riktlinjer för rörelseorganens sjukdomar</em> (Swedish National Guidelines). 2012, The National Board of Health and Welfare. doi:10.1097/BRS.0b013e31829ff095 https://www.socialstyrelsen.se/publikationer2012/2012-5-1</a></div><div>2. Jevsevar, D.S., et al., <em>The American Academy of Orthopaedic Surgeons evidence-based guideline on: treatment of osteoarthritis of the knee, 2nd edition.</em> J Bone Joint Surg Am, 2013. <strong>95</strong>(20): p. 1885-6. <a href="http://www.ncbi.nlm.nih.gov/pubmed/24288804" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">http://www.ncbi.nlm.nih.gov/pubmed/24288804</a></div><div>3. Namba, R.S., et al., <em>Obesity and perioperative morbidity in total hip and total knee arthroplasty patients.</em> J Arthroplasty, 2005. <strong>20</strong>(7 Suppl 3): p. 46-50. <a href="https://dx.doi.org/10.1016/j.arth.2005.04.023" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">https://dx.doi.org/10.1016/j.arth.2005.04.023</a></div><div>4. Peter, W.F., et al., <em>Physiotherapy in hip and knee osteoarthritis: development of a practice guideline concerning initial assessment, treatment and evaluation.</em> Acta Reumatol Port, 2011. <strong>36</strong>(3): p. 268-81. <a href="http://www.ncbi.nlm.nih.gov/pubmed/22113602" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">http://www.ncbi.nlm.nih.gov/pubmed/22113602</a></div><div>5. Santoso, M.B. and L. Wu, <em>Unicompartmental knee arthroplasty, is it superior to high tibial osteotomy in treating unicompartmental osteoarthritis? A meta-analysis and systemic review.</em>&nbsp;J Orthop Surg Res, 2017. <strong>12</strong>(1): p. 50.&nbsp;<a href="https://dx.doi.org/10.1186/s13018-017-0552-9" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">https://dx.doi.org/10.1186/s13018-017-0552-9</a></div><div>6. Management of osteoarthritis. NICE guidelines. NICE Pathway last updated: 22 January 2019. <a href="https://pathways.nice.org.uk/pathways/osteoarthritis/management-of-osteoarthritis.pdf" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">https://pathways.nice.org.uk/pathways/osteoarthritis/management-of-osteoarthritis.pdf</a></div><div>&nbsp;</div>'

文本包含幾個指向 doi 的鏈接和關鍵字。 我怎樣才能得到所有這些,也許在一個列表中,例如

['doi:10.1097/BRS.0b013e31829ff095',
'https://dx.doi.org/10.1016/j.arth.2005.04.023',
'https://dx.doi.org/10.1016/j.arth.2005.04.023',
'https://dx.doi.org/10.1186/s13018-017-0552-9',
]

我已經查找了幾個相同的正則表達式但無濟於事。 如:

import re
exp = "10.\\d{4,9}/[-._;()/:a-z0-9A-Z]+"
pattern = re.compile(exp)

pattern.findall(txt)

這將返回一個空列表。

感謝@wiktor-stribiżew,我讓它工作了。

exp = "10.\\d{4,9}/[-._;()/:a-z0-9A-Z]+"
pattern = re.compile(exp)
 
print( pattern.findall(txt) )

['10.1097/BRS.0b013e31829ff095', '10.1016/j.arth.2005.04.023', '10.1016/j.arth.2005.04.023', '10.1186/s13018-017-0552-9', '10.1186/s13018-017-0552-9']

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM