简体   繁体   中英

Convert string into binary vector in R

I'm trying to clusterize a set of journals by descriptors and I've been thinking of turning descriptors into a binary vector instead of using string distances (that I've been using so far) to avoid issues like matching "Catalysis" and "Analysis" or matching long strings for (undesired) partial matches.

To implement this idea, I've separated every descriptor that journals may present into a set of 266 strings(isolated_cat) in alphabetic order.

dput(head(isolated_cat))
c("Accounting", "AcousticsUltrasonics", "AdvancedSpecializedNursing", 
"AerospaceEngineering", "Aging", "AgriculturalBiologicalSciences"
)

For each journal in my dataframe, I have a column with a set of descriptors, eg

journals_STEM$Categories4dist[1]
[1] "Biomaterials ElectronicOpticalMagneticMaterials Energy MaterialsChemistry SurfacesCoatingsFilms"

The output I'm expecting is a 266 long vector with 0 and 1 for each category in isolated_cat indicating whether the descriptors include that word or not (afterwards I was thinking of testing PCA and different clustering methods to separate journals into groups).

First, I tried

as.numeric(isolated_cat %in% aux$Categories4dist[i])

which obviously (I noticed later) only works for journals defined by a single category. I've been trying different blends of grep, but I haven't been lucky. Is there any straight way of achieving this? The only solutions I have found thus far are way too convoluted and I think I'm missing something obvious.

Sth. like:

library(stringr)

isolatedcat <- c("Accounting", "AcousticsUltrasonics", "AdvancedSpecializedNursing", "AerospaceEngineering", "Aging", "AgriculturalBiologicalSciences", 'Biomaterials')


Categories4dist <- str_split('Biomaterials ElectronicOpticalMagneticMaterials Energy MaterialsChemistry SurfacesCoatingsFilms', ' ', simplify = TRUE)

as.data.frame(sapply(isolatedcat, function(x) as.numeric(str_detect(x, Categories4dist))))

which gives:

  Accounting AcousticsUltrasonics AdvancedSpecializedNursing
1          0                    0                          0
2          0                    0                          0
3          0                    0                          0
4          0                    0                          0
5          0                    0                          0
  AerospaceEngineering Aging AgriculturalBiologicalSciences Biomaterials
1                    0     0                              0            1
2                    0     0                              0            0
3                    0     0                              0            0
4                    0     0                              0            0
5                    0     0                              0            0

Here's a base R option with lapply and grepl -

journals_STEM[isolated_cat] <- lapply(isolated_cat, function(x) 
            +(grepl(x, journals_STEM$Categories4dist, ignore.case = TRUE)))

The above would also match with a substring meaning "at" would match with "cat". If you need an exact match use word boundary ( \\b ).

journals_STEM[isolated_cat] <- lapply(paste0('\\b', isolated_cat, '\\b'), 
      function(x) +(grepl(x, journals_STEM$Categories4dist, ignore.case = TRUE)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM