如何使用 stringr (str_split) 或 strsplit with r 保留正则表达式

Question

I have some filthy data that needs tidying.我有一些需要整理的肮脏数据。 Here's an example:下面是一个例子：

x <- "FIRST LAST Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009  Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 

FIRSTA LASTA Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 1002 MAIN AVE, CA 90009 DR. JOHN SMITH Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 Smith REX

FIRSTB LASTB Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 Somewhere"

I'm trying to get a table with single column and rows of data split by first -space- last name (all caps, eg FIRST LAST, FIRSTA LASTA, FIRSTB, LASTB), while preserving said name.我正在尝试获取一个表，其中包含按第一个空格-姓氏（全部大写，例如 FIRST LAST、FIRSTA LASTA、FIRSTB、LASTB）拆分的单列和数据行，同时保留所述名称。 I started with base strsplit but gave up.我从基础 strsplit 开始，但放弃了。 Here are my stringr attempts:这是我的字符串尝试：

str_split(x, "[A-Z]+ (?=[A-Z]+)")

This is pretty close but I loose the names.这很接近，但我忘记了名字。

str_split(x, "(?<=[A-Z]+) (?=[A-Z]+)")

This throws an error due to lack of bounded maximum.由于缺乏有界最大值，这会引发错误。

Expected output:预期输出：

[1] FIRST LAST Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009  Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 

[2] FIRSTA LASTA Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 1002 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 

[3] FIRSTB LASTB Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553

Answer 1

I guess what you want is to get each record, if so you need to split on either the newline that precedes a first name last name:我想您想要的是获取每条记录，如果是这样，您需要在名字姓氏之前的换行符上拆分：

str_split(x, "\\n(?=[A-Za-z]+ [A-Za-z]+)")
[[1]]
[1] "FIRST LAST Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009  Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 \n"
[2] "FIRSTA LASTA Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 1002 MAIN AVE, CA 90009 DR. JOHN SMITH Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 Smith REX\n"
[3] "FIRSTB LASTB Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 Somewhere"

Dynamic length lookbehinds aren't supported by the underlaying regex library that's used by {stringr} {stringr}使用的底层正则表达式库不支持动态长度后{stringr}

Following the discussion with @Onyambu if you're dataset doesn't have line feeds ie newlines you can use the following:在与@Onyambu讨论之后，如果您的数据集没有换行符，即换行符，您可以使用以下内容：

str_split(x, " +(?=[A-Za-z]+ [A-Za-z]+ [A-Z][a-z]+ \\d{2}, \\d{4} +\\d{7})")
[[1]]
[1] "FIRST LAST Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009  Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555" 
[2] "FIRSTA LASTA Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 1002 MAIN AVE, CA 90009 DR. JOHN SMITH Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 Smith REX"
[3] "FIRSTB LASTB Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 Somewhere"

If the name is uppercase as @Onyambu suggested then the regex could be simplified :如果名称是@Onyambu建议的大写，则可以简化正则表达式：

tr_split(x, " +(?=[A-Z]+ [A-Z]+ [A-Z][a-z]+ \\d)")
[[1]]
[1] "FIRST LAST Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555 100 Main St. Somewhere, CA 90009  Atorvastatin Calcium Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 1, 2020 Sep 1, 2020   1234567 Jan 1, 1985 555-555-5555" 
[2] "FIRSTA LASTA Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 1002 MAIN AVE, CA 90009 DR. JOHN SMITH Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 12, 2020 Sep 12, 2020   2234567 Jan 12, 1985 555-555-5552 Smith REX"
[3] "FIRSTB LASTB Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 1003 Main St. Somewhere, CA 90009  Cetirizine HCl Diflucan Flonase Allergy Relief Hydrochlorothiazide HydrOXYzine Pamoate Oct 13, 2020 Sep 13, 2020   3234567 Jan 13, 1985 555-555-5553 Somewhere"

如何使用 stringr (str_split) 或 strsplit with r 保留正则表达式

问题描述

1 个解决方案

解决方案1
2 2020-11-18 20:30:42

如何使用 stringr (str_split) 或 strsplit with r 保留正则表达式

问题描述

1 个解决方案

解决方案1 2 2020-11-18 20:30:42

解决方案1
2 2020-11-18 20:30:42