Extract first digit sequence from string containing digits, non-digits and then digits

Question

I have a column in a Pandas dataframe that contains values as follows:

111042345--
111042345
110374217dclid=CA-R3K
109202817lciz@MM10082IA

I need to extract just the first sequence of digits in each row - not all of the digits in the row. So the output would be like this:

I thought the best way to achieve that would be to split the strings by digits and return that but that would give me the unwanted digits after the non-digit characters.

Answer 1

Use str.extract with regex \\d for extract digits, {,5} means first 5 digits and + is for all digits:

df['first_5_digits'] = df['Col'].str.extract('(\d{,5})')
df['all_digits'] = df['Col'].str.extract('(\d+)')
print (df)
                       Col first_5_digits all_digits
0              111042345--          11104  111042345
1                111042345          11104  111042345
2    110374217dclid=CA-R3K          11037  110374217
3  109202817lciz@MM10082IA          10920  109202817

Like @ Jon Clements pointed is also possible extract N values by indexing:

df['first_5_digits'] = df['Col'].str.extract('(\d+)').str[:5]

Answer 2

You can solve this by applying itertools.takewhile :

In pandas:

data = """111042345--
111042345
110374217dclid=CA-R3K
109202817lciz@MM10082IA""".split()

import pandas as pd
from itertools import takewhile

df = pd.DataFrame(data)

df["numbers"] = df[0].apply(lambda x:''.join(takewhile(str.isdigit,x)) )
print(df)

Output (Pandas):

                         0    numbers
0              111042345--  111042345
1                111042345  111042345
2    110374217dclid=CA-R3K  110374217
3  109202817lciz@MM10082IA  109202817

For normal lists:

data = """111042345--
111042345
110374217dclid=CA-R3K
109202817lciz@MM10082IA""".split()

print(data)

from itertools import takewhile

d = [ ''.join(takewhile(str.isdigit,text)) for text in data]

print(d)

Output (simple lists):

# splitted data
['111042345--', '111042345', '110374217dclid=CA-R3K', '109202817lciz@MM10082IA']

# itertools.takewhile
['111042345', '111042345', '110374217', '109202817']

Edge case:

if you need negative numbers or decimals you would have to replace str.isdigit with an other (maybe self-written) function that also accepts signs/decimals: see fe What's the difference between str.isdigit, isnumeric and isdecimal in python?

Suggested by Scott Boston because more efficient:

df["faster numbers"] = [''.join(takewhile(str.isdigit,i)) for i in df[0]]´

( Similar output - other column header )

Answer 3

It can be solved using regex:

import re
data = """111042345--
111042345
110374217dclid=CA-R3K
109202817lciz@MM10082IA""".split()
output = "\n".join([re.findall('\d+', str(d))[0] for d in data])
print(output)

Output:

Extract first digit sequence from string containing digits, non-digits and then digits

Question

3 answers

solution1
3 ACCPTED 2019-02-08 15:03:28

solution2
1 2019-02-08 15:04:02

solution3
0 2019-02-08 15:25:12

Extract first digit sequence from string containing digits, non-digits and then digits

Question

3 answers

solution1 3 ACCPTED 2019-02-08 15:03:28

solution2 1 2019-02-08 15:04:02

solution3 0 2019-02-08 15:25:12

solution1
3 ACCPTED 2019-02-08 15:03:28

solution2
1 2019-02-08 15:04:02

solution3
0 2019-02-08 15:25:12