简体   繁体   中英

R - partial matches string

I've a problem to do partial matches from a list of strings to a dataframe.

My df has this structure:

> df
    mrun                                        address stat
 8988741 cerro pedregal 8536 , Antofagasta, Antofagasta   OK
17625851              rancagua 2777 , Iquique, Tarapacá   OK
 9423953              picarte 4100 , Valdivia, Los Ríos   OK
 3459140           balmaceda 935 , Temuco, La Araucanía   OK
24507700             rancagua 1940, La Serena, Coquimbo   OK

and I have a list of strings with this values:

> address_list
c("balmaceda", "rancagua", "bombero garrido")

How can i select the rows than matched with any elements in the list?


This is my desire output:

> df_solution
    mrun                                        address stat
17625851              rancagua 2777 , Iquique, Tarapacá   OK
 3459140           balmaceda 935 , Temuco, La Araucanía   OK
24507700             rancagua 1940, La Serena, Coquimbo   OK 

Edit: The solution given by saurav shekhar works for an address_list with few elements. In my case, my real address_list has above 5000 rows and df has 200000 rows and grep throws this error:

> df$flag[grep(address_list,df$address)]<- 1
Error in grep(address_list,df$address) : 
  invalid regular expression, reason 'Out of memory'

I have a lot of RAM so i don't think about it. I looked for an solution but i didn't find any way to do it. The only close thread in SO is this link , but i didn't know how to apply to my case.

My session info:

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Spanish_Latin America.1252  LC_CTYPE=Spanish_Latin America.1252   
[3] LC_MONETARY=Spanish_Latin America.1252 LC_NUMERIC=C                          
[5] LC_TIME=Spanish_Latin America.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gridExtra_2.2.1 ggplot2_2.2.0   plyr_1.8.4      reshape_0.8.6  

First thing you need to do is to create matching variable in following format:

address_list<- paste(address_list, collapse = ",")
address_list<- gsub("," , "|" , address_list)
address_list<- c("balmaceda|rancagua|bombero|garrido")

Then using grep you can do a partial matching on your data and create a flag for rows to keep.

# grep(address_list,df$address) Try this and note the output for your understanding of `grep`

df$flag<- NA
df$flag[grep(address_list,df$address)]<- 1 #flag rows with matching values
df_new<- df[which(df$flag==1),]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM