I have a column in a Pandas dataframe that contains values as follows:
111042345--
111042345
110374217dclid=CA-R3K
109202817lciz@MM10082IA
I need to extract just the first sequence of digits in each row - not all of the digits in the row. So the output would be like this:
111042345
111042345
110374217
109202817
I thought the best way to achieve that would be to split the strings by digits and return that but that would give me the unwanted digits after the non-digit characters.
Use str.extract
with regex \\d
for extract digits, {,5}
means first 5 digits and +
is for all digits:
df['first_5_digits'] = df['Col'].str.extract('(\d{,5})')
df['all_digits'] = df['Col'].str.extract('(\d+)')
print (df)
Col first_5_digits all_digits
0 111042345-- 11104 111042345
1 111042345 11104 111042345
2 110374217dclid=CA-R3K 11037 110374217
3 109202817lciz@MM10082IA 10920 109202817
Like @ Jon Clements pointed is also possible extract N values by indexing:
df['first_5_digits'] = df['Col'].str.extract('(\d+)').str[:5]
You can solve this by applying itertools.takewhile :
In pandas:
data = """111042345--
111042345
110374217dclid=CA-R3K
109202817lciz@MM10082IA""".split()
import pandas as pd
from itertools import takewhile
df = pd.DataFrame(data)
df["numbers"] = df[0].apply(lambda x:''.join(takewhile(str.isdigit,x)) )
print(df)
Output (Pandas):
0 numbers
0 111042345-- 111042345
1 111042345 111042345
2 110374217dclid=CA-R3K 110374217
3 109202817lciz@MM10082IA 109202817
For normal lists:
data = """111042345--
111042345
110374217dclid=CA-R3K
109202817lciz@MM10082IA""".split()
print(data)
from itertools import takewhile
d = [ ''.join(takewhile(str.isdigit,text)) for text in data]
print(d)
Output (simple lists):
# splitted data
['111042345--', '111042345', '110374217dclid=CA-R3K', '109202817lciz@MM10082IA']
# itertools.takewhile
['111042345', '111042345', '110374217', '109202817']
Edge case:
Suggested by Scott Boston because more efficient:
df["faster numbers"] = [''.join(takewhile(str.isdigit,i)) for i in df[0]]´
( Similar output - other column header )
It can be solved using regex:
import re
data = """111042345--
111042345
110374217dclid=CA-R3K
109202817lciz@MM10082IA""".split()
output = "\n".join([re.findall('\d+', str(d))[0] for d in data])
print(output)
Output:
111042345
111042345
110374217
109202817
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.