简体   繁体   中英

Extract a number from a string, after a certain character

Ayyyy, I need some help. I have the following strings, always in "char,num" format:

s = "abcdef,12"
v = "gbhjjj,699"

I want to get just the digits after the comma, how do I do that without splitting the string with the comma as a delimiter?

I tried s[-2:] and v[-3:] which works, but how do I make it work without knowing the number of digits?

Assuming:

  • You know there is a comma in the string, so you don't have to search the entire string to find out if there is or not.
  • You know the pattern is 'many_not_digits,few_digits' so there is a big imbalance between the size of the left/right parts either side of the comma.
  • You can get to the end of the string without walking it, which you can in Python because string indexing is constant time

Then you could start from the end and walk backwards looking for the comma, which would be less overall work for your examples than walking from the left looking for the comma.

Doing work in Python code is way slower than using Python engine code written in C, right? So would it really be faster?

  1. Make a string "aaaaa....,12"
  2. use the timeit module to compare each approach - split, or right-walk.
  3. Timeit does a million runs of some code.
  4. Extend the length of "aaaaaaaaaaaaaaaa....,12" to make it extreme.

How do they compare?

  • String split: 1400 "a"'s run a million times took 1 second.
  • String split: 4000 "a"'s run a million times took 2 seconds.
  • Right walk: 1400 "a"'s run a million times took 0.4 seconds.
  • Right walk: 999,999 "a"'s run a million times took ... 0.4 seconds.

!

from timeit import timeit

_split = """num = x.split(',')[-1]"""

_rwalk = """
i=-1
while x[i] != ',':
    i-=1
num = x[i+1:]
"""

print(timeit(_split, setup='x="a"*1400 + ",12"'))
print(timeit(_rwalk, setup='x="a"*999999 + ",12"'))

eg

1.0063155219977489     # "aaa...,12" for 1400 chars, string split
0.4027107510046335     # "aaa...,12" for 999999 chars, rwalked. Faster.

Try it online at repl.it

I don't think this is algorithmically better than O(n), but with the constraints of the assumptions I made you have more knowledge than str.split() has, and can leverage that to skip walking most of the string and beat it in practise - and the longer the text part, and shorter the digit part, the more you benefit.

If you are worried about using split from the left because of lots of unwanted characters in the beginning, use rsplit.

s = "abcdef,12"
s.rsplit(",", 1)[-1]

Here, rsplit will start splitting the string from the right and the optional second argument we used will stop rsplit to proceed further than the first comma operator it encountered.

(eg):
s = "abc,def,12"
s.rsplit(",", 1)[-1]
# Outputs 12
s = "abcdef12"
s.rsplit(",", 1)[-1]
# Outputs abcdef12

This will be lot simpler and cleaner to get the string of numbers in the end rather than doing anything manually.

Not to mention, it will be lot easier if we wish to check whether we get only numbers with this. Even if it is a list of strings.

def get_numbers(string_list, skip_on_error=True):
    numbers_list = []
    for input_string in string_list:
        the_number = input_string.rsplit(",", 1)[-1]
        if the_number.isdigit():
            numbers_list.append(the_number)
        elif skip_on_error:
            numbers_list.append("")
        else:
            raise Exception("Wrong Format occurred: %s" % (input_string))
    return numbers_list

And if you are looking for even further optimization and sure that most(if not all) strings will be of the correct format, you can even use try except if you are going to go with an integer list instead of string list. Like this:

# Instead of the if.. elif.. else construct
try:
    numbers_list.append(int(the_number))
except ValueError:
    if skip_on_error:
        numbers_list.append(0)
    else:
        raise Exception("Wrong Format occurred: %s" % (input_string))

But always remember the Zen Of Python and using split/rsplit follows these:

  1. Beautiful is better than ugly
  2. Explicit is better than implicit
  3. Simple is better than complex
  4. Readability counts
  5. There should be one-- and preferably only one --obvious way to do it

And also remember Donald Knuth:

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil . Yet we should not pass up our opportunities in that critical 3%

Using split is superior because it is very clear and fast:

>>> s = "abcdef,12"
>>> s.split(',')[1]
'12'

Another way is with index or find :

>>> s = "abcdef,12"
>>> s[s.find(',')+1:]
'12'

And another way with re :

>>> import re
>>> s = "abcdef,12"
>>> re.search(r',(.*)', s).group(1)
'12'

And with csv (and io so I don't have to write a file to the hard drive):

>>> import csv
>>> import io
>>> s = "abcdef,12"
>>> r = csv.reader(i)
>>> for line in r:
...     print(line[1])
...
12

I'm sure there are other ways to accomplish this task. This is just a small sample.

Maybe you can try with a regular expression

import re

input_strings = ["abcdef,12", "gbhjjj,699"]

matcher = re.compile("\d+$")

for input_string in input_strings:
    is_matched = matcher.search(input_string)
    if is_matched:
        print(is_matched.group())

I like .partition() for this kind of thing:

for text in ('gbhjjj,699', 'abcdef,12'):

    x, y, z = text.partition(',')

    number = int(z)

    print(number)

Unlike .split() it will always return three values.

I'll sometimes do this to emphasize that I don't care about certain values:

_, _, z = text.partition(',')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM