简体   繁体   中英

searching a string of text in R producing crossable

I have a data frame where columns have strings of text. I want to search within the text of multiple columns and count the occurrence of specific words, and then make a table that reports the co-occurence (by multiple columns).

Here's a small data frame to make this reproducible:

#create dataframe


TI <- c('Studies of pancreatic cancer',
        'colon, breast, and pancreatic cancer',
        'cancer and CVD paper', 
        'CVD paper'
        )
AB <- c('Autoimmune pancreatitis (AIP) is now considered a disease to be treated by diet.',
        'dietary patterns, positive associations were found between pancreatic cancer risk ',
        'Cardiovascular diseases (CVD) is linked to Diet',
        'making this match with pancreas'
        )

df <- data.frame(TI, AB)

#TI = title (of a paper)
#AB = abstract of the same paper

I then want to see how many titles have either the word 'cancer' or 'CVD' in the title (TI column), and the co-occurrence with the words 'pancreatic (or some variation)' or 'diet' in the abstract (AB column)

I can count the single occurence of the words I want with grep

pancreatic.ab <- length(grep("pancreat*", df$AB, ignore.case = TRUE, perl = FALSE))
pancreatic.ab

diet.ab <- length(grep("diet*", df$AB, ignore.case = TRUE, perl = FALSE))
diet.ab

cancer.ti <- length(grep("cancer*", df$TI, ignore.case = TRUE, perl = FALSE))
cancer.ti

CVD.ti <- length(grep("CVD", df$TI, ignore.case = TRUE, perl = FALSE))
CVD.ti

but not sure how to do this for a complicated cross tabs table.

any suggestions?

An example of desired output would be something like this示例表

You can bind the rows that have your targets in the TI columns, providing a label for the columns belonging to each (in my example ti ). Then, grouping by that label (ie group_by(ti) ), you can create the columns with the counts, using summarize()

bind_rows(
  df %>% filter(grepl("cancer*", TI, ignore.case=T)) %>% mutate(ti="CANCER"),
  df %>% filter(grepl("CVD*", TI, ignore.case=T)) %>% mutate(ti="CVD")
) %>% 
  group_by(ti) %>% 
  summarize(
    TotalAB = n(),
    Pancreatic = sum(grepl("pancreat*", AB, ignore.case=T)),
    Diet = sum(grepl("diet*", AB, ignore.case=T))
    )

Output:

  ti     TotalAB Pancreatic  Diet
  <chr>    <int>      <int> <int>
1 CANCER       3          2     3
2 CVD          2          1     1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM