简体   繁体   English

如何从一个表,另一个表中查找搜索词,然后在结果中创建新列?

[英]How to find search words from a table, in another table, and then create new columns of the results?

I'm trying to find specifice words listed in a tibble arbeit in the another tibble rawEng$Text . 我试图在另一个rawEng$Text找到arbeit中列出的特殊单词。 If a word, or words, were found, I want to create, or mutate, a new data frame iDataArbeit with two new columns, one for the found word/s wArbeit , and one for the sum of there tf-idf iArbeit scores from arbeit$tfidf 如果找到一个或多个单词,我想创建或iDataArbeit一个新的数据框iDataArbeit ,其中包含两列新列,一个列用于找到的单词wArbeit ,一个列用于存储tf-idf iArbeit得分的总和arbeit$tfidf

My Data: 我的资料:

arbeit: 尽管:

     X1 feature                   tfidf
  <dbl> <chr>                     <dbl>
1     0 sick                      0.338
2     2 contract                  0.188
3     3 pay                       0.175
4     4 job                       0.170
5     5 boss                      0.169
6     6 sozialversicherungsnummer 0.169

rawEng: rawEng:

Gender Gruppe        Datum               Text                                            
  <chr>  <chr>         <dttm>              <chr>                                           
1 F      Berlin Expats 2017-07-07 00:00:00 Anyone out there who's had to apply for Führung~
2 F      FAB           2018-01-18 00:00:00 Dear FAB, I am in need of a Führungszeugnis no ~
3 M      Free Advice ~ 2017-01-30 00:00:00 Dear Friends, i would like to ask you how can I~
4 M      FAB           2018-04-12 00:00:00 "Does anyone know why the \"Standesamt Pankow (~
5 F      Berlin Expats 2018-11-12 00:00:00 having trouble finding consistent information a~
6 F      Toytown Berl~ 2017-06-08 00:00:00 "Hello\r\n\r\nI have a question regarding Airbn~

I've tried with dplyr::mutate , using this code: 我已经尝试使用dplyr::mutate ,使用以下代码:

idataEnArbeit <- mutate(rawEng, wArbeit = ifelse((str_count(rawEng$Text, arbeit$feature))>=1,
                                                       arbeit$feature, NA),
                        iArbeit = ifelse((str_count(rawEng$Text, arbeit$feature))>=1,
                                         arbeit$tfidf, NA))

but all I get is one Word, and it's tf-idf score, in the new columens iDatatArbeit$wArbeit and iDataArbeit$iArbeit 但是我得到的只是一个单词,在新的iDatatArbeit$wArbeitiDataArbeit$iArbeit是tf-idf分数

Gender Gruppe          Datum               Text                           wArbeit iArbeit
  <chr>  <chr>           <dttm>              <chr>                          <chr>     <dbl>
1 F      Berlin | Girl ~ 2018-09-11 13:22:05 "11 septembre, 13:21     GGI ~ sick      0.338
2 F      ExpatBabies Be~ 2017-10-19 16:24:23 "16:24   Babysitter needed! B~ sick      0.338
3 F      Berlin | Girl ~ 2018-06-22 18:24:19 "gepostet.       Leonor Valen~ sick      0.338
4 F      'Neu in Berlin' 2018-09-18 23:19:51 "Hello guys, I am working wit~ sick      0.338
5 M      Free Advice Be~ 2018-04-27 08:49:24 "In need of legal advice: Wha~ sick      0.338
6 F      Free Advice Be~ 2018-07-04 18:33:03 "Is there somebody I can pay ~ sick      0.338

In summary: I want all words from arbeit$feature which are found in rawEng$Text to be added in iDataArbeit$wArbeit , and the sum of there tf-idf score to be added in iDataArbeit$iArbeit 总结:我希望将在rawEng$Text中找到的来自arbeit$feature所有单词都添加到iDataArbeit$wArbeit ,并将tf-idf得分的总和添加到iDataArbeit$iArbeit

Since I don't have your data, I'll import the gutenbergr library and play w/ Treasure Island. 由于我没有您的数据,因此我将导入gutenbergr库并使用金银岛玩。

library(tidytext)
library(gutenbergr)

## Now get the dataset
Treasure_Island <- gutenberg_works(title == "Treasure Island") %>% pull(gutenberg_id) %>% 
  gutenberg_download(.)

## and construct a toy arbeit:
arbeit <- data.frame(feature = c("island", "treasure", "to"),
                     tfidf = c(0.3,0.5,0.6))

## Break up a word into it's components (the head is just to keep the example short... you omit)
tidy_treasure <- unnest_tokens(Treasure_Island, feature, text, drop = FALSE) %>% 
  head(500)

## now bring the tfidf into tidy_treasure
df <- left_join(tidy_treasure, arbeit, by = "feature")

## and now you can average by sentence normally.
## To get the words we have to throw out the words that don't contribute to our tfidf.

## Two options:
df %>% filter(!is.na(tfidf)) %>% group_by(text) %>% summarize(AveTFIDF = sum(tfidf, na.rm = TRUE),
                                    Words = paste(feature, collapse = ";"))  

## Or if you want to keep a row for each found word, we can't use summarize, but we can still add them all up.
df %>% filter(!is.na(tfidf)) %>% group_by(text) %>% mutate(AveTFIDF = sum(tfidf, na.rm = TRUE))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 R 中,如何从另一个表创建具有唯一行的表,然后将新列添加到新表 - In R, How to create a table with unique rows from another table and then add new columns to new table 从另一个表的多个列创建一个表 - Create a table from multiple columns of another table 如何根据 R 中一个表中的两列之间的依赖关系和另一个表的结果过滤结果? - How to filter results based on dependencies between two columns in one table and results from another table in R? 如何使用 R 中另一个表中的列创建数据集? - How to create a dataset using columns from another table in R? 如何创建一个新表来汇总另一个数据框中的数据? - How to create a new table that summarises data from another data frame? 在R中使用表格中的列创建新表格 - Using Columns from Table in R to create New Table 通过从另一个表中划分所有可能的列组合来创建新的 dataframe - Create new dataframe by dividing all possibles columns combination from another table 如何使用多列从 R 中的现有表创建新表 - How to create a new table from an existing table in R, using multiple columns 如何根据另一个表中的多个条件在特定表中创建新变量 - How to create a new variable in a specific table from multiple conditions in another table 如何在 R 中运行多个相关性测试并根据函数的结果创建一个新表 - How to run several correlations tests in R and create a new table out of results from function
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM