简体   繁体   中英

how to preserve regex using stringr (str_split) or strsplit with r

I have some filthy data that needs tidying. Here's an example:

x <- "FIRST LAST Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009  Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 

FIRSTA LASTA Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 1002 MAIN AVE, CA 90009 DR. JOHN SMITH Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 Smith REX

FIRSTB LASTB Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 Somewhere"
    

I'm trying to get a table with single column and rows of data split by first -space- last name (all caps, eg FIRST LAST, FIRSTA LASTA, FIRSTB, LASTB), while preserving said name. I started with base strsplit but gave up. Here are my stringr attempts:

str_split(x, "[A-Z]+ (?=[A-Z]+)")

This is pretty close but I loose the names.

str_split(x, "(?<=[A-Z]+) (?=[A-Z]+)")

This throws an error due to lack of bounded maximum.

Expected output:

[1] FIRST LAST Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009  Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 

[2] FIRSTA LASTA Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 1002 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 

[3] FIRSTB LASTB Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553

I guess what you want is to get each record, if so you need to split on either the newline that precedes a first name last name:

str_split(x, "\\n(?=[A-Za-z]+ [A-Za-z]+)")
[[1]]
[1] "FIRST LAST Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009  Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 \n"
[2] "FIRSTA LASTA Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 1002 MAIN AVE, CA 90009 DR. JOHN SMITH Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 Smith REX\n"
[3] "FIRSTB LASTB Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 Somewhere"  

Dynamic length lookbehinds aren't supported by the underlaying regex library that's used by {stringr}

Following the discussion with @Onyambu if you're dataset doesn't have line feeds ie newlines you can use the following:

str_split(x, " +(?=[A-Za-z]+ [A-Za-z]+ [A-Z][a-z]+ \\d{2}, \\d{4} +\\d{7})")
[[1]]
[1] "FIRST LAST Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009  Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555" 
[2] "FIRSTA LASTA Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 1002 MAIN AVE, CA 90009 DR. JOHN SMITH Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 Smith REX"
[3] "FIRSTB LASTB Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 Somewhere"   

If the name is uppercase as @Onyambu suggested then the regex could be simplified :

tr_split(x, " +(?=[A-Z]+ [A-Z]+ [A-Z][a-z]+ \\d)")
[[1]]
[1] "FIRST LAST Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009  Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555" 
[2] "FIRSTA LASTA Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 1002 MAIN AVE, CA 90009 DR. JOHN SMITH Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 Smith REX"
[3] "FIRSTB LASTB Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 Somewhere"  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM