简体   繁体   English

使用正则表达式从给定的链接列表中提取pdf链接

[英]Extracting pdf links from given list of Links using regular expressions

I have a list of links stored as a LIST. 我有一个存储为LIST的链接列表。 But I need to extract only the PDF links. 但是我只需要提取PDF链接。

    links = [ '<a class="tablebluelink" href="https://www.samplewebsite.com/xml-data/abcdef/higjkl/Thisisthe-required-document-b4df-16t9g8p93808.pdf" target="_blank"><img alt="Download PDF" border="0" src="../Include/images/pdf.png"/></a>', '<a class="tablebluelink" href="https://www.samplewebsite.com/xml-data/abcdef/higjkl/Thisisthe-required-document-link-4ea4-8f1c-dd36a1f55d6f.pdf" target="_blank"><img alt="Download PDF" border="0" src="../Include/images/pdf.png"/></a>']

So I need to extract only the link starting from 'https' and and ending with pdf as given below 所以我只需要提取从“ https”开始并以pdf结尾的链接,如下所示

    https://www.samplewebsite.com/xml-data/abcdef/higjkl/Thisisthe-required-document-b4df-16t9g8p93808.pdf

And store this link in a list. 并将此链接存储在列表中。 There are many pdf links in the variable 'links'. 变量“链接”中有许多pdf链接。 Need to store all the pdf links in the variable named 'pdf_links' 需要将所有pdf链接存储在名为“ pdf_links”的变量中

Can anyone suggest me regular expression to extract this pdf link ? 有人可以建议我用正则表达式来提取此pdf链接吗? I have used the below regular expression but its not working. 我使用了下面的正则表达式,但是它不起作用。

    pdf_regex = r""" (^<a\sclass="tablebluelink"\shref="(.)+.pdf"$)"""

Everybody will tell you that's wrong to process HTML using regex. 每个人都会告诉您使用正则表达式处理HTML是错误的。 Instead of showing you anyways how it can be done that way I would like to show you how easy it actually is to parse HTML with a library, eg BeautifulSoup 4 which is often recommended. 与其以任何方式向您展示如何做到这一点,我不希望向您展示使用一个库(例如,经常推荐的BeautifulSoup 4)解析HTML实际上是多么容易。

To keep it simple and close to your sample code, I just flatten your input list. 为了使它更简单并接近您的示例代码,我只对输入列表进行了展平。 Usually, you would feed the raw HTML directly to the parser (eg see here ). 通常,您会将原始HTML直接提供给解析器(例如,参见此处 )。

from bs4 import BeautifulSoup
links = [ '<a class="tablebluelink" href="https://www.samplewebsite.com/xml-data/abcdef/higjkl/Thisisthe-required-document-b4df-16t9g8p93808.pdf" target="_blank"><img alt="Download PDF" border="0" src="../Include/images/pdf.png"/></a>', '<a class="tablebluelink" href="https://www.samplewebsite.com/xml-data/abcdef/higjkl/Thisisthe-required-document-link-4ea4-8f1c-dd36a1f55d6f.pdf" target="_blank"><img alt="Download PDF" border="0" src="../Include/images/pdf.png"/></a>']

soup = BeautifulSoup(''.join(links), 'lxml')
for link in soup.find_all('a', href=True):
    if link['href'].lower().endswith(".pdf"):
        print(link['href'])

Easy and straightforward, isn't it? 简单明了,不是吗?

As Daniel Lee pointed out, regular expressions are not suitable for parsing HTML. 正如Daniel Lee指出的,正则表达式不适合解析HTML。 However, as long as your HTML follows certain patterns for all cases, something like this should do the trick (obviously, just in a sandbox environment): 但是,只要您的HTML在所有情况下都遵循特定的模式,就可以完成以下工作(显然,只是在沙盒环境中):

import re

pdf_links = map(lambda extracted_link: extracted_link.group(1),
                filter(lambda extracted_link: extracted_link \
                is not None, map(lambda link: \
                re.search(r'.*href=\"([^\"]+\.pdf)\".*', link,
                re.IGNORECASE), links)))

Firstly, you should NEVER parse html with regex. 首先,您永远不要使用正则表达式解析html。

"Parsing html with regex is like asking a beginner to write an operating system" “使用正则表达式解析html就像要求初学者编写操作系统一样”

This answer is famous and forever relevent: RegEx match open tags except XHTML self-contained tags 这个答案是著名的,永远的相关性: RegEx匹配开放标记,除了XHTML自包含标记

It's probably worthwhile to take an hour and learn how matching groups work in regex. 花一个小时学习匹配组在正则表达式中的工作方式可能是值得的。 But, this may help: 但是,这可能会有所帮助:

Firstly, links is a list. 首先, links是一个列表。 Which means you either need to loop through it or (in this case) you need to take the first element. 这意味着您需要遍历它,或者(在这种情况下)您需要采用第一个元素。

try 尝试

 import re
 r = re.match(regex, lists[0])
 if r:
     print(r.group(1))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM