Parse substring in email data

Question

data = " #33986=\r\n6 for User ID: 125091. "

Above is what I have split down from an email, the data I need is from it is only this: 339866

The data is dynamic meaning the email sent to us is from a human so we have to parse it and catch the data. Things about the data that distinguishes it from other:

Starts with a 3 and is always 6 characters long. What can I do to convert this into code to parse and find it?

What is the best way to clean the substring from the HTML and random letters to only get the numbers and ignore the second set of numbers?

I have do the following:

data = re.findall("\d+", data)

Response is:

['33986', '6', '125091']

It's a very ugly response, is there a cleaner method?

Answer 1

data = " #33986=\r\n6 for User ID: 125091. "
x = re.search(r"(3\d\d\d\d)\S\s\s(\d)", data)
data = x.group(1) + x.group(2)

This will print the data you need, use int(data) if you need the final data to be an integer

Answer 2

Welcome to StackOverflow.

You could filter the remaining data for the conditions you need.

Mind that your string is only 5 digits long.

data = ['33986', '6', '125091'] 
  
for s in data:
   if len(s)==5 and s[0]=="3":
      print("This is the solution: "+s)

Answer 3

The most simplest way to get what you want would be to add a join statement and trim the string to first 6 digits only. You can do it as follows.

data = " #33986=\r\n6 for User ID: 125091. "
d = ''.join(re.findall("\d+", data))[:6]
print (d)

You can get the first sequence of 6 digits that start with 3 using the below code.

x = re.findall(r"\D(3\d{5})\D", " "+s+" ")[0]

If you want to get all of them, then you can skip the [0] . It will give you a list of values. Remember it will still pick only the numbers that start with 3 and has 6 digits. If you want all the numbers, then use the below code.

x = re.findall("\d+", s)

If you want to concatenate all of them into a single number, then you can do the following.

''.join(re.findall("\d+", s))

If you want to concatenate only the the first 2 elements of the regEx, then you can use

''.join(re.findall("\d+", s)[:2])

Here's what I got with the below code:

data = " #33986=\r\n6 for User ID: 125091. "

#to get the first 2 digits, use this regex
x = ''.join(re.findall("\d+", data)[:2])
print (x)

#if you want all the numbers, then you can use this code
y = ''.join(re.findall("\d+", data))
print (y)

Output:

339866
339866125091

Answer 4

Might be not the optimized solution. But will solve your case.

import re
data = " #33986=\r\n6 for User ID: 125091. "
data = re.findall("\d+", data)
final_result = ""
fixed_length = 6
for element in data:
    if final_result:
        if fixed_length - len(final_result) > len(element):
            final_result += element[0:fixed_length - len(final_result)]
        else:
            final_result += element
        
    if not final_result and element[0] == '3' and len(element) > 6:
        final_result += element[0:6]
        break
    
    if not final_result and element[0] == '3' and len(element) < 6:
        final_result += element
    if len(final_result) == 6:
        break
print(final_result)

Output:-339866

Answer 5

import re
data = r'#33986=\r\n6 for User ID: 125091.'
data1 = re.search("(3\d{4}).....(\d)", data)
print(data1.group(1)+ data1.group(2))

Answer 6

Try this

codes=[ c[1:][:-1] for c in re.findall('[^0-9]3[0-9]{5}[^0-9]', data)]

Parse substring in email data

Question

6 answers

solution1
1 2020-08-15 04:56:50

solution2
0 2020-08-15 04:56:17

solution3
0 2020-08-15 05:23:43

solution4
0 2020-08-15 05:34:07

solution5
0 2020-08-15 05:34:43

solution6
0 2020-08-15 06:52:47

Parse substring in email data

Question

6 answers

solution1 1 2020-08-15 04:56:50

solution2 0 2020-08-15 04:56:17

solution3 0 2020-08-15 05:23:43

solution4 0 2020-08-15 05:34:07

solution5 0 2020-08-15 05:34:43

solution6 0 2020-08-15 06:52:47

solution1
1 2020-08-15 04:56:50

solution2
0 2020-08-15 04:56:17

solution3
0 2020-08-15 05:23:43

solution4
0 2020-08-15 05:34:07

solution5
0 2020-08-15 05:34:43

solution6
0 2020-08-15 06:52:47