简体   繁体   中英

pyspark regex to match domain\username pattern

I have string with domain\\username in an array. I want to match it and replace it.

The string has following pattern:

[, DESKTOP-XXQYY56\Adminaccount, ] [, MB4345XX\adminaccount, ]

The code I am using is as follows:

df2= df1.withColumn(
    'str1',
     regexp_replace(
        'str',
        r'^([A-Za-z0-9]+(-[A-Za-z0-9]+)*)+(\\?([A-Za-z0-9])+)*',
        'AB22'
    )
)

I am not able to match the pattern correctly. I want to match the string and replace it. Please suggest.

If you want to match that format and replace the domain\\user\u003c/code> with XXXX you might use 2 capturing groups for the opening [, and closing , ]

You could omit the anchor ^ and in this part ([A-Za-z0-9])+ move the quantifier + to the character class [A-Za-z0-9]+ or else you would repeat the group matching a single char.

If you are not using the capturing groups separately for further processing you could turn them into non capturing groups (?:

The pattern might look like

(\[, )[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*(?:\\?[A-Za-z0-9]+)*(, \])

In parts

  • (\\[, ) Capture group 1 match [,
  • [A-Za-z0-9]+ Match 1+ times any of the listed in the character class
  • (?: Non capturing group
    • -[A-Za-z0-9]+ Match - and match 1+ times any of the listed
  • )* Close non capturing group and repeat 0+ times
  • (?: Non capturing group
    • \\\\?[A-Za-z0-9]+ Match optional \\ and 1+ times any of the listed
  • )* Close non capturing group and repeat 1+ times
  • (, \\]) Capture group 2 match , ]

In the replacement use the 2 capturing groups

$1XXXX$2

Regex demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM