简体   繁体   中英

Remove from list in python 3

I have a list that I have scraped from a website. I want to remove the links that are anchors for the various pages of the site, for example '/about/'. There are a number of them. Rather than make different loops that remove from the list, is there a way that I can build a code that looks at the text and if "http" (not just https like the data below has because what if the "s" is not there) is in the text then it would add it to the list? My list data is this:

['mailto:info@yourdomain.com', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', '/events/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'mailto:Info@yourdomain.com', '/about/', '/events/', '/news/', '/contact/', 'https://youtechassociates.com/', '/privacy-policy', '/terms-of-use', '/disclosure/']

You can use a list-comprehension with a regex to filter out links that do not contain the protocol:

[link for link in links if re.match('https?\:\/\/', link)]

giving:

['https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'https://youtechassociates.com/']

You can use filter to get this result

a = ['mailto:info@yourdomain.com', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', '/events/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'mailto:Info@yourdomain.com', '/about/', '/events/', '/news/', '/contact/', 'https://youtechassociates.com/', '/privacy-policy', '/terms-of-use', '/disclosure/']

b = filter(lambda x: 'http' not in  x, a)
print(list(b))

Output:

['mailto:info@yourdomain.com', '/events/', 'mailto:Info@yourdomain.com', '/about/', '/events/', '/news/', '/contact/', '/privacy-policy', '/terms-of-use', '/disclosure/']

Here is a simple way to do this:

mlist = your-list-as-specified-above

newlist = []
for m in mlist:
    if m.startswith('http'):
        newlist.append(m)

I would go with list comprehension and startswith() :

full_links = [link for link in links if link.startswith('http://') or link.startswith('https://')]

I think this is clearer than regex when you have such a simple task. Also, IMO you should ask for http:// and https:// explicitly, because only using http might give you false positives if you meet relative links like http_stuff/foo.html .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM