简体   繁体   中英

Extract first digit sequence from string containing digits, non-digits and then digits

I have a column in a Pandas dataframe that contains values as follows:

111042345--
111042345
110374217dclid=CA-R3K
109202817lciz@MM10082IA

I need to extract just the first sequence of digits in each row - not all of the digits in the row. So the output would be like this:

111042345
111042345 
110374217 
109202817

I thought the best way to achieve that would be to split the strings by digits and return that but that would give me the unwanted digits after the non-digit characters.

Use str.extract with regex \\d for extract digits, {,5} means first 5 digits and + is for all digits:

df['first_5_digits'] = df['Col'].str.extract('(\d{,5})')
df['all_digits'] = df['Col'].str.extract('(\d+)')
print (df)
                       Col first_5_digits all_digits
0              111042345--          11104  111042345
1                111042345          11104  111042345
2    110374217dclid=CA-R3K          11037  110374217
3  109202817lciz@MM10082IA          10920  109202817

Like @ Jon Clements pointed is also possible extract N values by indexing:

df['first_5_digits'] = df['Col'].str.extract('(\d+)').str[:5]

You can solve this by applying itertools.takewhile :

In pandas:

data = """111042345--
111042345
110374217dclid=CA-R3K
109202817lciz@MM10082IA""".split()

import pandas as pd
from itertools import takewhile

df = pd.DataFrame(data)

df["numbers"] = df[0].apply(lambda x:''.join(takewhile(str.isdigit,x)) )
print(df)

Output (Pandas):

                         0    numbers
0              111042345--  111042345
1                111042345  111042345
2    110374217dclid=CA-R3K  110374217
3  109202817lciz@MM10082IA  109202817

For normal lists:

data = """111042345--
111042345
110374217dclid=CA-R3K
109202817lciz@MM10082IA""".split()

print(data)

from itertools import takewhile

d = [ ''.join(takewhile(str.isdigit,text)) for text in data]

print(d)

Output (simple lists):

# splitted data
['111042345--', '111042345', '110374217dclid=CA-R3K', '109202817lciz@MM10082IA']

# itertools.takewhile
['111042345', '111042345', '110374217', '109202817']

Edge case:


Suggested by Scott Boston because more efficient:

df["faster numbers"] = [''.join(takewhile(str.isdigit,i)) for i in df[0]]´

( Similar output - other column header )

It can be solved using regex:

import re
data = """111042345--
111042345
110374217dclid=CA-R3K
109202817lciz@MM10082IA""".split()
output = "\n".join([re.findall('\d+', str(d))[0] for d in data])
print(output)

Output:

111042345
111042345
110374217
109202817

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM