简体   繁体   English

使用startswith函数过滤url列表

[英]Using startswith function to filter a list of urls

I have the following piece of code which extracts all links from a page and puts them in a list ( links=[] ), which is then passed to the function filter_links() .我有以下一段代码,它从页面中提取所有链接并将它们放在一个列表中( links=[] ),然后将其传递给函数filter_links() I wish to filter out any links that are not from the same domain as the starting link, aka the first link in the list.我希望过滤掉与起始链接不同域的任何链接,也就是列表中的第一个链接。 This is what I have:这就是我所拥有的:

import requests
from bs4 import BeautifulSoup
import re

start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
    links.append(tag['href'])


def filter_links(links):
    filtered_links = []
    for link in links:
        if link.startswith(links[0]):
            filtered_links.append(link)
        return filtered_links


print(filter_links(links))

I have used the built-in startswith function, but its filtering out everything except the starting url.我使用了内置的 startswith 函数,但它过滤掉了除起始 url 之外的所有内容。 Eventually I want to pass several different start urls through this program, so I need a generic way of filtering urls that are within the same domain as the starting url.I think I could use regex but this function should work too?最终我想通过这个程序传递几个不同的起始 url,所以我需要一种过滤与起始 url 相同域内的 url 的通用方法。我想我可以使用正则表达式,但这个函数也应该工作?

Try this :尝试这个 :

import requests
from bs4 import BeautifulSoup
import re
import tldextract

start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
    links.append(tag['href'])

def filter_links(links):
    ext = tldextract.extract(start_url)
    domain = ext.domain
    filtered_links = []
    for link in links:
        if domain in link:
            filtered_links.append(link)
    return filtered_links


print(filter_links(links))

Note :注意

  1. You need to get that return statement out of the for loop.您需要从 for 循环中取出 return 语句。 It is just returning the result after iterating over just one element and thus only the first item inside a list is only getting returned.它只是在迭代一个元素后返回结果,因此只返回列表中的第一项。
  2. Use tldextract module to better extract the domain name from the urls.使用tldextract模块可以更好地从 url 中提取域名。 If you want to explicitly check whether the links starts with links[0] , it's up to you.如果您想明确检查链接是否以links[0]开头, links[0]您决定。

Output :输出

['http://enzymebiosystems.org', 'http://enzymebiosystems.org/', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/recent-developments/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/contact-us/', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/investors-media/news/', 'http://enzymebiosystems.org/investors-media/investor-relations/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/investors-media/stock-information/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/contact-us']

Okay so you made an indentation error in filter_links(links) .好的,所以您在filter_links(links)出现了缩进错误。 The function should be like this功能应该是这样的

def filter_links(links):
    filtered_links = []
    for link in links:
        if link.startswith(links[0]):
            filtered_links.append(link)
    return filtered_links

Notice that in your code, you kept the return statement inside the for loop so, the for loop gets executed once and then returns the list.请注意,在您的代码中,您将 return 语句保留在 for 循环中,因此 for 循环执行一次然后返回列表。

Hope this helps :)希望这可以帮助 :)

Possible Solution可能的解决方案

What about if you kept all links which 'contain' the domain?如果您保留所有“包含”域的链接会怎样?

For example例如

import pandas as pd

links = []
for tag in soup.find_all('a', href=True):
    links.append(tag['href'])

all_links = pd.DataFrame(links, columns=["Links"])
enzyme_df = all_links[all_links.Links.str.contains("enzymebiosystems")]

# results in a dataframe with links containing "enzymebiosystems". 

If you want to search multiple domains, see this answer如果要搜索多个域,请参阅此答案

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM