简体   繁体   English

从python 3中的列表中删除

[英]Remove from list in python 3

I have a list that I have scraped from a website. 我有一个从网站上删除的列表。 I want to remove the links that are anchors for the various pages of the site, for example '/about/'. 我想删除网站各个页面的锚点链接,例如'/ about /'。 There are a number of them. 有很多。 Rather than make different loops that remove from the list, is there a way that I can build a code that looks at the text and if "http" (not just https like the data below has because what if the "s" is not there) is in the text then it would add it to the list? 有没有一种方法可以构建一个查看文本的代码,如果是“http”(不仅仅是https,就像下面的数据一样,因为如果“s”不在那里)是在文本然后它会将其添加到列表中? My list data is this: 我的列表数据是这样的:

['mailto:info@yourdomain.com', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', '/events/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'mailto:Info@yourdomain.com', '/about/', '/events/', '/news/', '/contact/', 'https://youtechassociates.com/', '/privacy-policy', '/terms-of-use', '/disclosure/']

You can use a list-comprehension with a regex to filter out links that do not contain the protocol: 您可以使用带有正则表达式的list-comprehension来过滤掉不包含协议的链接:

[link for link in links if re.match('https?\:\/\/', link)]

giving: 赠送:

['https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'https://youtechassociates.com/']

You can use filter to get this result 您可以使用过滤器来获得此结果

a = ['mailto:info@yourdomain.com', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', '/events/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'mailto:Info@yourdomain.com', '/about/', '/events/', '/news/', '/contact/', 'https://youtechassociates.com/', '/privacy-policy', '/terms-of-use', '/disclosure/']

b = filter(lambda x: 'http' not in  x, a)
print(list(b))

Output: 输出:

['mailto:info@yourdomain.com', '/events/', 'mailto:Info@yourdomain.com', '/about/', '/events/', '/news/', '/contact/', '/privacy-policy', '/terms-of-use', '/disclosure/'] ['mailto:info@yourdomain.com','/ events /','mailto:Info@yourdomain.com','/ about /','/ events /','/ news /','/ contact /' ,'/ privacy-policy','/ terms-of-use','/ disclosure /']

Here is a simple way to do this: 这是一个简单的方法:

mlist = your-list-as-specified-above

newlist = []
for m in mlist:
    if m.startswith('http'):
        newlist.append(m)

I would go with list comprehension and startswith() : 我会使用list comprehension和startswith()

full_links = [link for link in links if link.startswith('http://') or link.startswith('https://')]

I think this is clearer than regex when you have such a simple task. 当你有这么简单的任务时,我认为这比正则表达式更清晰。 Also, IMO you should ask for http:// and https:// explicitly, because only using http might give you false positives if you meet relative links like http_stuff/foo.html . 另外,IMO你应该明确地要求http://https:// ,因为如果遇到像http_stuff/foo.html这样的相对链接,只使用http可能会给你误报。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM