![](/img/trans.png)
[英]text mining with tm package in R ,remove words starting from [http] or any other specifc word
[英]Text Mining in R - Remove Rows from Text File Starting With Keywords
我正在將文本文件讀入R,如下所示:
test<-readLines("D:/AAPL MSFT Earnings Calls/Test/Test.txt")
該文件是從PDF轉換而來的,保留了一些我想擺脫的標頭數據。 它們將以諸如“頁面”,“市值”之類的詞開頭。
如何刪除TXT文件中以這些關鍵字開頭的所有行? 這與刪除包含該單詞的行相反。
使用以下答案之一,我修改了一些內容以閱讀
setwd("C:/Users/George/Google Drive/PhD/Strategic agility/Source Data/Peripherals Earnings Calls 2016")
text1<-readLines("test.txt")
text
library(purrr)
library(stringr)
text1 <- "foo
Page, bar
baz
Market Cap, qux"
text1 <- readLines(con = textConnection(file))
ignore_patterns <- c("^Page,", "^Market\\s+Cap,")
text1 %>% discard(~ any(str_detect(.x, ignore_patterns)))
text1
這是我得到的輸出:
> text1
[1] "foo" "Page, bar" "baz" "Market Cap, qux"
foo / baz / qux字符是什么? 謝謝
# once you have read and stored in a data.frame
# perform below subsetting :
x = grepl("^(Page|Market Cap)", df$id) # where df is you data.frame and 'id' is your
# column name that has those unwanted keywords
df <- df[!x,] # does the job!
^
有助於檢查開始情況。 因此,如果行以Page
或( |
) Market Cap
開頭,則grepl
返回TRUE
library(purrr)
library(stringr)
file <- "foo
Page, bar
baz
Market Cap, qux"
test <- readLines(con = textConnection(file))
ignore_patterns <- c("^Page,", "^Market\\s+Cap,")
test %>% discard(~ any(str_detect(.x, ignore_patterns)))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.