简体   繁体   中英

Conditional String Match R Character Vector Collapse Select Elements

I have a character vector where I'd like to match a specific string and then collapse the element containing that string match only with the next element in the character vector and then allow the process to continue until the character vector ends. For example just one situation:

'"FundSponsor:Blackrock Advisors" "Category:"  "Tax-Free Income-Pennsylvania"  "Ticker:"  "MPA" "NAV Ticker:" "XMPAX"                          "Average Daily Volume (shares):" "26,000"                         "Average Daily Volume (USD):"    "$0.335M"                        "Inception Date:"  "10/30/1992" "Inception Share Price:" "$15.00"                         "Inception NAV:" "$14.18" "Tender Offer:" "No"                             "Term:" "No"'   

Combining each element containing a : with only the element following it would be great BUT I've struggled with using the paste function because it just generally collapses the entire vector based on the : into one element which is not the more targeted solution I'm looking for.

Here's an example of what I'd like a portion of the revised output to look like:

"Inception Share Price:$15.00"

I am not sure if you want the outcome to be one single key: value format or if you just want to clean that long string and have it in the following format key1: value1 key2: value2 key3: value3. If this is the case, you can achieve it via the following code:

char = '"FundSponsor:Blackrock Advisors" "Category:" "Tax-Free Income-Pennsylvania" "Ticker:" "MPA" "NAV Ticker:" "XMPAX" "Average Daily Volume (shares):" "26,000" "Average Daily Volume (USD):" "$0.335M" "Inception Date:" "10/30/1992" "Inception Share Price:" "$15.00" "Inception NAV:" "$14.18" "Tender Offer:" "No" "Term:" "No"'

char_tidy = gsub('\\" \\"', " ", char)

# output is below
> char_tidy
[1] "\"FundSponsor:Blackrock Advisors Category: Tax-Free Income-Pennsylvania Ticker: MPA NAV Ticker: XMPAX Average Daily Volume (shares): 26,000 Average Daily Volume (USD): $0.335M Inception Date: 10/30/1992 Inception Share Price: $15.00 Inception NAV: $14.18 Tender Offer: No Term: No\""

Here is something that might help:

First split using strsplit , then bind elements that belong together

# split the string
vec <- unlist(strsplit(string, '(?=\")(?=\")', perl = TRUE))
vec <- vec[! vec %in% c(' ', '\"')]
# that's how vec looks like right now
head(vec)
# [1] "FundSponsor:Blackrock Advisors" "Category:"                      "Tax-Free Income-Pennsylvania"   "Ticker:"                        "MPA"                           
# [6] "NAV Ticker:"    
#
# now paste the elements
ind <- grepl(':.+',vec)
tmp <- vec[!ind]
vec[!ind] <- paste0(tmp[seq(1,length(tmp),2)], tmp[seq(2,length(tmp),2)])
head(vec)
# [1] "FundSponsor:Blackrock Advisors"        "Category:Tax-Free Income-Pennsylvania" "Ticker:MPA"                            "NAV Ticker:XMPAX"                     
# [5] "Average Daily Volume (shares):26,000"  "Average Daily Volume (USD):$0.335M" 

with the data

string = "\"FundSponsor:Blackrock Advisors\" \"Category:\" \"Tax-Free Income-Pennsylvania\" \"Ticker:\" \"MPA\" \"NAV Ticker:\" \"XMPAX\" \"Average Daily Volume (shares):\" \"26,000\" \"Average Daily Volume (USD):\" \"$0.335M\" \"Inception Date:\" \"10/30/1992\" \"Inception Share Price:\" \"$15.00\" \"Inception NAV:\" \"$14.18\" \"Tender Offer:\" \"No\" \"Term:\" \"No\""

Explanation

  • The regex (?=\\")(?=\\") basically tells R to split the string whenever there are two \\" . The syntax (?!*something*) means *something* comes before/after. So the above simply reads: split the string at every position that is preceeded by a \\" and that preceeds a \\" .
  • The strsplit(...) above creates elements of the form \\" and ( '\\"Category:\\" \\"...' becomes the vector '\\"';'Category:';'\\"';' ';'...' ). So by using ! vec %in% c(...) we remove those unwanted elements.

Addendum

If elements of the form "string:" followed by a " " are contained, in the above code remove the line vec <- vec[! vec %in% c(' ', '\\"')] vec <- vec[! vec %in% c(' ', '\\"')] and add the lines

vec <- vec[seq(2L, length(vec), 4L)]
vec[vec == ' '] <- NA_character_

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM