简体   繁体   中英

removing first four and last four characters of strings in list, OR removing specific character patterns

I am brand new to Python and have been working with it for a few weeks. I have a list of strings and want to remove the first four and last four characters of each string. OR, alternatively, removing specific character patterns (not just specific characters).

I have been looking through the archives here but don't seem to find a question that matches this one. Most of the solutions I have found are better suited to removing specific characters.

Here's the strings list I'm working with:

sites=['www.hattrick.com', 'www.google.com', 'www.wampum.net', 'www.newcom.com']

What I am trying to do is to isolate the domain names and get

[hattrick, google, wampum, newcom]

This question is NOT about isolating domain names from URLs (I have seen the questions about that), but rather about editing specific characters in strings in lists based upon location or pattern.

So far, I've tried .split, .translate, .strip but these don't seem appropriate for what I am trying to do because they either remove too many characters that match the search, aren't good for recognizing a specific pattern/grouping of characters, or cannot work with the location of characters within a string.

Any questions and suggestions are greatly appreciated, and I apologize if I'm asking this question the wrong way etc.

def remove_cruft(s):
    return s[4:-4]

sites=['www.hattrick.com', 'www.google.com', 'www.wampum.net', 'www.newcom.com']
[remove_cruft(s) for s in sites]

result:

['hattrick', 'google', 'wampum', 'newcom']

If you know all of the strings you want to strip out, you can use replace to get rid of them. This is useful if you're not sure that all of your URLs will start with "www.", or if the TLD isn't three characters long.

def remove_bad_substrings(s):
    badSubstrings = ["www.", ".com", ".net", ".museum"]
    for badSubstring in badSubstrings:
        s = s.replace(badSubstring, "")
    return s

sites=['www.hattrick.com', 'www.google.com', 
'www.wampum.net', 'www.newcom.com', 'smithsonian.museum']
[remove_bad_substrings(s) for s in sites]

result:

['hattrick', 'google', 'wampum', 'newcom', 'smithsonian']

You could use the tldextract module, which is much more robust than parsing the strings yourself:

>>> sites=['www.hattrick.com', 'google.co.uk',
           'apps.s3.stackoverflow.com', 'whitehouse.gov']
>>> import tldextract
>>> [tldextract.extract(s).domain for s in sites]
['hattrick', 'google', 'stackoverflow', 'whitehouse']

Is this what you mean:

>>> sites=['nosubdomain.net', 'ohcanada.ca', 'www.hattrick.com', 'www.google.com', 'www.wampum.net', 'www.newcom.com']
>>> print [x.split('.')[-2] for x in sites]
['nosubdomain', 'ohcanada', 'hattrick', 'google', 'wampum', 'newcom']

I'm not clear about your requirements for removing specific characters, but if all you want to do is remove the first and last four characters, you can use python's built in slicing:

str = str[4:-4]

This will give you the substring starting at index 4, up to but not including the 4th-last index of the string.

EDIT: here is a good question that provides lots of info about python's slice notation.

Reading your subject, this is an answer, but maybe not what you are looking for.

for site in sites:
    print(site[:4]) # www .
    print(site[-4:]) # .com / .net / ...

You could also use regex:

import re
re.sub('^www\.','',sites[0])  # removes 'www.' if exists
re.sub('\.\w+$','',sites[0])  # removes chars after last dot & dot

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM