简体   繁体   中英

Regex to match multiple numbers within string

I have a regex that looks like this to extract order numbers from columns:

df["Orders"].str.extract('([0-9]{9,10}[/+ #_;.-]?)')

The orders column can look like this:

12
123456789
1234567890
123456789/1234567890
123456789/1/123456789
123456789+1234567890

The resulting new column in the dataframe after the regex should look like this:

NaN
123456789
1234567890
123456789/1234567890
123456789/123456789
123456789+1234567890

However, with my current regex I'm getting the following result:

NaN
123456789
1234567890
123456789/
123456789/
123456789+

How can I get the result that I'm looking for?

You can use

import pandas as pd
df = pd.DataFrame({'Orders':['12','123456789','1234567890','123456789/1234567890','123456789/1/123456789','123456789+1234567890', 'Order number: 6508955960_000010_1005500']})
df["Result"] = df["Orders"].str.findall(r'[/+ #_;.-]?(?<![0-9])[0-9]{9,10}(?![0-9])').str.join('').str.lstrip('/+ #_;.-')
df.loc[df['Result'] == '', 'Result'] = np.nan

See the regex demo . Details

  • [/+ #_;.-]?(?<![0-9])[0-9]{9,10}(?![0-9]) - matches an optional / , + , space, # , _ , ; , . or - char, and then none or ten digit number not enclosed with other digits
  • Series.str.findall extracts all occurrences
  • .str.join('') concatenates the matches into a single string
  • .str.lstrip('/+ #_;.-') - removes the special chars that were matched with the number at the beginning of the string
  • df.loc[df['Result'] == '', 'Result'] = np.nan - if needed - replaces empty strings with np.nan values in the Result column.

Output:

>>> df
                  Orders                Result
0                    NaN                   NaN
1              123456789             123456789
2             1234567890            1234567890
3   123456789/1234567890  123456789/1234567890
4  123456789/1/123456789   123456789/123456789
5   123456789+1234567890  123456789+1234567890
>>> 

You can adapt the next code to work with data frames,

RegExp: (?:^|([/+ #_;.-]))(?:\d{1,8})(?!\d)

  • (?:\d{1,8})(?!\d) - find a number (<9 digits)
  • ([/+ #_;.-]) - preceded by one/none of the possible delimiters (group #1)

Conditionally replace with NaN or empty string - subst uses match.group(1) to differ between the two options:

  • standalone-invalid - 12
  • invalid-with-delimiter - /1
import re

regex = r"(?:^|([/+ #_;.-]))(?:\d{1,8})(?!\d)"

test_str = ("12\n"
            "123456789\n"
            "1234567890\n"
            "123456789/1234567890\n"
            "123456789/1/123456789\n"
            "123456789+1234567890")

def subst(match):
    m = match.group(1)
    return "" if m else "NaN"

result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print(result)

Output:

NaN
123456789
1234567890
123456789/1234567890
123456789/123456789
123456789+1234567890

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM