简体   繁体   中英

Slice a string after a certain phrase?

I've got a batch of strings that I need to cut down. They're basically a descriptor followed by codes. I only want to keep the descriptor.

'a descriptor dps 23 fd'
'another 23 fd'
'and another fd'
'and one without a code'

The codes above are dps , 23 and fd . They can come in any order, are unrelated to each other and might not exist at all (as in the last case).

The list of codes is fixed (or can be predicted, at least), so assuming a code is never used within a legitimate descriptor, how can I strip off everything after the first instance of a code.

I'm using Python.

The short answer, as @THC4K points out in a comment:

string.split(pattern, 1)[0]

where string is your original string, pattern is your "break" pattern, 1 indicates to split no more than 1 time, and [0] means take the first element returned by split.

In action:

>>> s = "a descriptor 23 fd"
>>> s.split("23", 1)[0]
'a descriptor '
>>> s.split("fdasfdsafdsa", 1)[0]
'a descriptor 23 fd'

This is a much shorter way of expressing what I had written earlier, which I will keep here anyway.

And if you need to remove multiple patterns, this is a great candidate for the reduce builtin:

>>> string = "a descriptor dps foo 23 bar fd quux"
>>> patterns = ["dps", "23", "fd"]
>>> reduce(lambda s, pat: s.split(pat, 1)[0], patterns, string)
'a descriptor '
>>> reduce(lambda s, pat: s.split(pat, 1)[0], patterns, "uiopuiopuiopuipouiop")
'uiopuiopuiopuipouiop'

This basically says: for each pat in patterns : take string and repeatedly apply string.split(pat, 1)[0] (like explained above), operating on the result of the previously returned value each time. As you can see, if none of the patterns are in the string, the original string is still returned.


The simplest answer is a list/string slice combined with a string.find :

>>> s = "a descriptor 23 fd"
>>> s[:s.find("fd")]
'a descriptor 23 '
>>> s[:s.find("23")]  
'a descriptor '
>>> s[:s.find("gggfdf")] # <-- look out! last character got cut off
'a descriptor 23 f'

A better approach (to avoid cutting off the last character in a missing pattern when s.find returns -1) might be to wrap in a simple function:

>>> def cutoff(string, pattern):
...     idx = string.find(pattern)
...     return string[:idx if idx != -1 else len(string)]
... 
>>> cutoff(s, "23")
'a descriptor '
>>> cutoff(s, "asdfdsafdsa")
'a descriptor 23 fd'

The [:s.find(x)] syntax means take the part of the string from index 0 until the right-hand side of the colon; and in this case, the RHS is the result of s.find , which returns the index of the string you passed.

You seem to be describing something like this:

def get_descriptor(text):
    codes = ('12', 'dps', '23')
    for c in codes:
        try:
            return text[:text.index(c)].rstrip()
        except ValueError:
            continue

    raise ValueError("No descriptor found in `%s'" % (text))

Eg,

>>> get_descriptor('a descriptor dps 23 fd')
'a descriptor'
codes = ('12', 'dps', '23')

def get_descriptor(text):
    words = text.split()
    for c in codes:
        if c in words:
            i = words.index(c)
            return " ".join(words[:i])
    raise ValueError("No code found in `%s'" % (text))

I'd probably use a regular expression to do this:

>>> import re
>>> descriptors = ('foo x', 'foo y', 'bar $', 'baz', 'bat')
>>> data = ['foo x 123', 'foo y 123', 'bar $123', 'baz 123', 'bat 123', 'nothing']
>>> p = re.compile("(" + "|".join(map(re.escape, descriptors)) + ")")
>>> for s in data:
        m = re.match(p, s)
        if m: print m.groups()[0]
foo x
foo y
bar $
baz
bat

It wasn't entirely clear to me whether you want what you're extracting to include text that precedes the descriptors, or if you expect each line of text to start with a descriptor; the above deals with the latter. For the former, just change the pattern slightly to make it capture all characters before the first occurrence of a descriptor:

>>> p = re.compile("(.*(" + "|".join(map(re.escape, descriptors)) + "))")

Here's an answer that works for all codes rather than forcing you to call the function for each code, and is a bit simpler than some of the answers above. It also works for all of your examples.

strings = ('a descriptor dps 23 fd', 'another 23 fd', 'and another fd',
                  'and one without a code')
codes = ('dps', '23', 'fd')

def strip(s):
    try:
        return s[:min(s.find(c) for c in codes if c in s)]
    except ValueError:
        return s

print map(strip, strings)

Output:

['a descriptor ', 'another ', 'and another ', 'and one without a code']

I believe this satisfies all of your criteria.

Edit: I realized quickly you could remove the try catch if you don't like expecting the exception:

def strip(s):
    if not any(c in s for c in codes):
        return s
    return s[:min(s.find(c) for c in codes if c in s)]
    def crop_string(string, pattern):
        del_items = []
        for indx, val in enumerate(pattern):
            a = string.split(val, 1)
            del_items.append(a[indx])

        for del_item in del_items:
            string = string.replace(del_item, "")
        return string

example:

I want to crop the string and get only the array out of it..

strin = "crop the array [1,2,3,4,5]
pattern["[","]"]

usage:

a = crop_string(strin ,pattern )
print a 

# --- Prints "[1,2,3,4,5]"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM