简体   繁体   中英

How to iterate through an R list of character vectors to modify each element by keeping all characters up to and including one character past comma

I have an R list of approx. 90 character vectors (representing 90 documents), each containing several author names. As a means to stem (or normalize, what have you) the names, I'd like to drop all characters after the white-space and first character just past the comma in each element. So, for example, "Smith, Joe" would become "Smith, J" (or "Smith J" would fine).

1) I've tried using lapply with str_sub, but I can't seem to specify keeping one character past the comma (each element has different character length). 2) I also tried using lapply to split on the comma and make the last and first names separate elements, then using modify_depth to apply str_sub, but I can't figure out how to specifically use the str_sub only on the second element.

Fake sample to replicate issue.

doc1 = c("King, Stephen", "Martin, George")

doc2 = c("Clancy, Tom", "Patterson, James", "Stine, R.L.")

author = list(doc1,doc2)

What I've tried:

myfun1 = function(x,arg1){str_split(x, ", ")}

author = lapply(author, myfun1)

myfun2 = function(x,arg1){str_sub(x, end = 1L)}

f2 = modify_depth(author, myfun2, .depth = 2)

f2

[[1]]
[[1]][[1]]
[1] "K" "S"

[[1]][[2]]
[1] "M" "G"

Ultimately, I'm hoping after applying a solution, including maybe using unite(), the result will be as follows:

[[1]]
[[1]][[1]]
[1] "King S"

[[1]][[2]]
[1] "Martin G"
lapply( author, function(x) gsub( "(^.*, [A-Z]).*$", "\\1", x))

# [[1]]
# [1] "King, S"   "Martin, G"
# 
# [[2]]
# [1] "Clancy, T"    "Patterson, J" "Stine, R"    

What it does:
lapply loops over list of authors
gsub replaces a part of the elements of the vectors, defined by the regex "(^.*, [AZ]).*$" with the first group (the part between the round brackets).
the regex "(^.*, [AZ]).*$" puts everything from the start ^.* , until (and including) the first 'comma space, captal' , [AZ] into a group.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM