简体   繁体   中英

Regex to extract unique string to new column, getting error "look-behind requires fixed-width pattern"

I need help extracting unique strings into a separate column.

df = pd.DataFrame({'File Name':['90.12.21 / 02.05 / XO3 File Name Type', 
                                '10.22.43 / X.89 / XO20G9992 Document Internal Only',
                                'Phase 3',
                                '22.32.42.12 / 99.23 / XO2 Location Site 3: Park Triangle',
                                '38.23.99.22 / X.23 / XO28W9998 Block 4 Beach/Dock Camp',
                                '39.24.32.49 / 37.29 / Blue-print/Register Info Site (RISs)',
                                '23.21.53.32 / Q.21 / XO R9924 Location Place 5: Drive Place (Active)',
                                '   33.51.63.33 / X.21 / XO20W8812 Area Place 1: Beach Drive']}) 

Here's what the dataframe currently looks like:

| File Name                                                            |
|----------------------------------------------------------------------|
| 90.12.21 / 02.05 / XO3 File Name Type                                |
| 10.22.43 / X.89 / XO20G9992 Document Internal Only                   |
| Phase 3                                                              |
| 22.32.42.12 / 99.23 / XO2 Location Site 3: Park Triangle             |
| 38.23.99.22 / X.23 / XO28W9998 Block 4 Beach/Dock Camp               |
| 39.24.32.49 / 37.29 / Blue-print/Register Info Site (RISs)           |
| 23.21.53.32 / Q.21 / XO R9924 Location Place 5: Drive Place (Active) |
| 33.51.63.33 / X.21 / XO20W8812 Area Place 1: Beach Drive             |

Here's what I need it to look like:

| File Name                              |
|----------------------------------------|
| File Name Type                         |
| Document Internal Only                 |
|                                        |
| Location Site 3: Park Triangle         |
| Block 4 Beach/Dock Camp                |
| Blue-print/Register Info Site (RISs)   |
| Location Place 5: Drive Place (Active) |
| Area Place 1: Beach Drive              |

Here's my attempted solution:

I know that str.extract(r'') will extract a Regex expression into a new column. I also know that in Regex, a "positive lookbehind" will select everything I want from the end of the string. So I created a positive lookbehind Regex expression that captures most of the strings I want: https://regexr.com/4t4ll . It's still not a perfect solution.

But even when I try extracting my selections using this line of code: df['File Name'].str.extract(r'((?<=\\/ XO\\d |XO\\d[0-9]\\w\\d\\d\\d\\d | XO \\w\\d\\d\\d\\d ).*)') , I get an error message: "look-behind requires fixed-width pattern."

I need help figuring out how to make my Regex expression work in str.extract(r'') and how can I make my Regex expression capture all the strings that appear at the end of each entry?

You may use

.*\s/(?:\s+XO[A-Z0-9\s]*\b)?\s+(.+)

See the regex demo .

Details

  • .* - 0+ chars other than line break chars, as many as possible
  • \\s - a whitespace
  • / - a / char
  • (?:\\s+XO[A-Z0-9\\s]*\\b)? - an optional pattern:
    • \\s+ - 1+ whitespaces
    • XO - XO
    • [A-Z0-9\\s]* - 0+ uppercase letters or digits followed with
    • \\b - a word boundary
  • \\s+ - 1+ whitespaces
  • (.+) - Group 1 (what str.extract will return): any 1+ chars other than line break chars, as many as possible

In Pandas, use

df['Result'] = df['File Name'].str.extract(r'.*\s/(?:\s+XO[A-Z0-9\s]*\b)?\s+(.+)', expand=False).fillna('')

Result:

                                   Result  
0  File Name Type                          
1  Document Internal Only                  
2                                          
3  Location Site 3: Park Triangle          
4  Block 4 Beach/Dock Camp                 
5  Blue-print/Register Info Site (RISs)    
6  Location Place 5: Drive Place (Active)  
7  Area Place 1: Beach Drive

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM