简体   繁体   English

如何使用 stringr (str_split) 或 strsplit with r 保留正则表达式

[英]how to preserve regex using stringr (str_split) or strsplit with r

I have some filthy data that needs tidying.我有一些需要整理的肮脏数据。 Here's an example:下面是一个例子:

x <- "FIRST LAST Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009  Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 

FIRSTA LASTA Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 1002 MAIN AVE, CA 90009 DR. JOHN SMITH Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 Smith REX

FIRSTB LASTB Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 Somewhere"
    

I'm trying to get a table with single column and rows of data split by first -space- last name (all caps, eg FIRST LAST, FIRSTA LASTA, FIRSTB, LASTB), while preserving said name.我正在尝试获取一个表,其中包含按第一个空格-姓氏(全部大写,例如 FIRST LAST、FIRSTA LASTA、FIRSTB、LASTB)拆分的单列和数据行,同时保留所述名称。 I started with base strsplit but gave up.我从基础 strsplit 开始,但放弃了。 Here are my stringr attempts:这是我的字符串尝试:

str_split(x, "[A-Z]+ (?=[A-Z]+)")

This is pretty close but I loose the names.这很接近,但我忘记了名字。

str_split(x, "(?<=[A-Z]+) (?=[A-Z]+)")

This throws an error due to lack of bounded maximum.由于缺乏有界最大值,这会引发错误。

Expected output:预期输出:

[1] FIRST LAST Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009  Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 

[2] FIRSTA LASTA Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 1002 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 

[3] FIRSTB LASTB Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553

I guess what you want is to get each record, if so you need to split on either the newline that precedes a first name last name:我想您想要的是获取每条记录,如果是这样,您需要在名字姓氏之前的换行符上拆分:

str_split(x, "\\n(?=[A-Za-z]+ [A-Za-z]+)")
[[1]]
[1] "FIRST LAST Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009  Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 \n"
[2] "FIRSTA LASTA Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 1002 MAIN AVE, CA 90009 DR. JOHN SMITH Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 Smith REX\n"
[3] "FIRSTB LASTB Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 Somewhere"  

Dynamic length lookbehinds aren't supported by the underlaying regex library that's used by {stringr} {stringr}使用的底层正则表达式库不支持动态长度后{stringr}

Following the discussion with @Onyambu if you're dataset doesn't have line feeds ie newlines you can use the following:在与@Onyambu讨论之后,如果您的数据集没有换行符,即换行符,您可以使用以下内容:

str_split(x, " +(?=[A-Za-z]+ [A-Za-z]+ [A-Z][a-z]+ \\d{2}, \\d{4} +\\d{7})")
[[1]]
[1] "FIRST LAST Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009  Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555" 
[2] "FIRSTA LASTA Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 1002 MAIN AVE, CA 90009 DR. JOHN SMITH Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 Smith REX"
[3] "FIRSTB LASTB Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 Somewhere"   

If the name is uppercase as @Onyambu suggested then the regex could be simplified :如果名称是@Onyambu建议的大写,则可以简化正则表达式:

tr_split(x, " +(?=[A-Z]+ [A-Z]+ [A-Z][a-z]+ \\d)")
[[1]]
[1] "FIRST LAST Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009  Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555" 
[2] "FIRSTA LASTA Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 1002 MAIN AVE, CA 90009 DR. JOHN SMITH Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 Smith REX"
[3] "FIRSTB LASTB Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 Somewhere"  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM