I have some filthy data that needs tidying. Here's an example:
x <- "FIRST LAST Sep 1, 2020 1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009 Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020 1234567 Jan 1, 1985 555-555-5555
FIRSTA LASTA Sep 12, 2020 2234567 Jan 12, 1985 555-555-5552 1002 MAIN AVE, CA 90009 DR. JOHN SMITH Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020 2234567 Jan 12, 1985 555-555-5552 Smith REX
FIRSTB LASTB Sep 13, 2020 3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009 Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020 3234567 Jan 13, 1985 555-555-5553 Somewhere"
I'm trying to get a table with single column and rows of data split by first -space- last name (all caps, eg FIRST LAST, FIRSTA LASTA, FIRSTB, LASTB), while preserving said name. I started with base strsplit but gave up. Here are my stringr attempts:
str_split(x, "[A-Z]+ (?=[A-Z]+)")
This is pretty close but I loose the names.
str_split(x, "(?<=[A-Z]+) (?=[A-Z]+)")
This throws an error due to lack of bounded maximum.
Expected output:
[1] FIRST LAST Sep 1, 2020 1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009 Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020 1234567 Jan 1, 1985 555-555-5555
[2] FIRSTA LASTA Sep 12, 2020 2234567 Jan 12, 1985 555-555-5552 1002 Main St. Somewhere, CA 90009 Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020 2234567 Jan 12, 1985 555-555-5552
[3] FIRSTB LASTB Sep 13, 2020 3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009 Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020 3234567 Jan 13, 1985 555-555-5553
I guess what you want is to get each record, if so you need to split on either the newline that precedes a first name last name:
str_split(x, "\\n(?=[A-Za-z]+ [A-Za-z]+)")
[[1]]
[1] "FIRST LAST Sep 1, 2020 1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009 Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020 1234567 Jan 1, 1985 555-555-5555 \n"
[2] "FIRSTA LASTA Sep 12, 2020 2234567 Jan 12, 1985 555-555-5552 1002 MAIN AVE, CA 90009 DR. JOHN SMITH Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020 2234567 Jan 12, 1985 555-555-5552 Smith REX\n"
[3] "FIRSTB LASTB Sep 13, 2020 3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009 Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020 3234567 Jan 13, 1985 555-555-5553 Somewhere"
Dynamic length lookbehinds aren't supported by the underlaying regex library that's used by {stringr}
Following the discussion with @Onyambu if you're dataset doesn't have line feeds ie newlines you can use the following:
str_split(x, " +(?=[A-Za-z]+ [A-Za-z]+ [A-Z][a-z]+ \\d{2}, \\d{4} +\\d{7})")
[[1]]
[1] "FIRST LAST Sep 1, 2020 1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009 Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020 1234567 Jan 1, 1985 555-555-5555"
[2] "FIRSTA LASTA Sep 12, 2020 2234567 Jan 12, 1985 555-555-5552 1002 MAIN AVE, CA 90009 DR. JOHN SMITH Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020 2234567 Jan 12, 1985 555-555-5552 Smith REX"
[3] "FIRSTB LASTB Sep 13, 2020 3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009 Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020 3234567 Jan 13, 1985 555-555-5553 Somewhere"
If the name is uppercase as @Onyambu suggested then the regex could be simplified :
tr_split(x, " +(?=[A-Z]+ [A-Z]+ [A-Z][a-z]+ \\d)")
[[1]]
[1] "FIRST LAST Sep 1, 2020 1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009 Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020 1234567 Jan 1, 1985 555-555-5555"
[2] "FIRSTA LASTA Sep 12, 2020 2234567 Jan 12, 1985 555-555-5552 1002 MAIN AVE, CA 90009 DR. JOHN SMITH Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020 2234567 Jan 12, 1985 555-555-5552 Smith REX"
[3] "FIRSTB LASTB Sep 13, 2020 3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009 Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020 3234567 Jan 13, 1985 555-555-5553 Somewhere"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.