繁体   English   中英

如何在 R tm 包中显示语料库文本?

[英]How to show corpus text in R tm package?

我是 R 和 tm 包的新手,所以请原谅我的愚蠢问题;-) 如何在 R tm 包中显示纯文本语料库的文本?

我在一个语料库中加载了一个包含 323 个纯文本文件的语料库:

 src <- DirSource("Korpora/technologie")
corpus <- Corpus(src)

但是当我调用语料库时:

corpus[[1]]

我总是得到一些这样的输出,而不是语料库文本本身:

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 144
Content:  chars: 141
Content:  chars: 224
Content:  chars: 75
Content:  chars: 105

如何显示语料库的文本?

谢谢!

UPDATE Reproducible sample:我已经用内置的示例文本试过了:

> data("crude")
> crude
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 20
> crude[1]
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1
> crude[[1]]
<<PlainTextDocument>>
Metadata:  15
Content:  chars: 527

如何打印文档的文本?

更新 2:会话信息:

> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tm_0.6-1  NLP_0.1-7

loaded via a namespace (and not attached):
[1] parallel_3.1.3 slam_0.1-32    tools_3.1.3   

这适用于我的,使用最新版本的 tm 打印内容文本,

corpus[[1]]$content

注意:或多或少如 Ricky 在上一条评论中所建议的那样。 抱歉,我想写评论,只有我的代表只有 25 个(需要最少 50 个代表才能发表评论)。

您可以尝试将语料库文本转换为数据框,并从数据框本身访问所需的文本。 我以内置的示例数据“粗略”(来自tm包)为例。

data("crude")
dataframe<-data.frame(text=unlist(sapply(crude, `[`, "content")), stringsAsFactors=F)

dataframe[1,]
[1] "Diamond Shamrock Corp said that\neffective today it had cut its contract prices for crude oil by\n1.50 dlrs a barrel.\n    The reduction brings its posted price for West Texas\nIntermediate to 16.00 dlrs a barrel, the copany said.\n    \"The price reduction today was made in the light of falling\noil product prices and a weak crude oil market,\" a company\nspokeswoman said.\n    Diamond is the latest in a line of U.S. oil companies that\nhave cut its contract, or posted, prices over the last two days\nciting weak oil markets.\n Reuter"

这是显示语料库文本的一种简单直接的方法:

strwrap(corpus[[1]])

对于粗数据,这将输出

[1] "Diamond Shamrock Corp said that effective today it had cut its contract"      
[2] "prices for crude oil by 1.50 dlrs a barrel.  The reduction brings its posted" 
[3] "price for West Texas Intermediate to 16.00 dlrs a barrel, the copany said."   
[4] "\"The price reduction today was made in the light of falling oil product"     
[5] "prices and a weak crude oil market,\" a company spokeswoman said.  Diamond is"
[6] "the latest in a line of U.S. oil companies that have cut its contract, or"    
[7] "posted, prices over the last two days citing weak oil markets.  Reuter"

我可以确认从 tm 0.6-1 开始,检查的打印效果不佳。 您可以将它与我维护的qdap包配对,以便轻松转换为 data.frame,如下所示:

library(qdap)
as.data.frame(crude)

为了使它更像旧的检查行为,您可以使用:

as.data.frame(crude) %>%
    with(., invisible(sapply(text, function(x) {strWrap(x); cat("\n\n")})))

这看起来像:

Diamond Shamrock Corp said that effective today it had cut its
contract prices for crude oil by 1.50 dlrs a barrel. The reduction
brings its posted price for West Texas Intermediate to 16.00 dlrs a
barrel, the copany said. "The price reduction today was made in the
light of falling oil product prices and a weak crude oil market," a
company spokeswoman said. Diamond is the latest in a line of U.S. oil
companies that have cut its contract, or posted, prices over the last
two days citing weak oil markets. Reuter


OPEC may be forced to meet before a scheduled June session to
readdress its production cutting agreement if the organization wants
to halt the current slide in oil prices, oil industry analysts said.
"The movement to higher oil prices was never to be as easy as OPEC
thought. They may need an emergency meeting to sort out the
problems," said Daniel Yergin, director of Cambridge Energy Research
Associates, CERA. Analysts and oil industry sources said the problem
OPEC faces is excess oil supply in world oil markets. "OPEC's problem
is not a price problem but a production issue and must be addressed
in that way," said Paul Mlotok, oil analyst with Salomon Brothers
Inc. He said the market's earlier optimism about OPE
.
.
.

从 tm Vignette,这有效:

writeLines(as.character(doc.corpus[[8]]))

其中“8”是您希望的任何元素编号

我们可以得到语料库中每一项的content

data("crude")
out <- sapply(crude, function(x){x$content})
out 

# optionally export
writeCorpus(out, "outputdir/", filenames = "corpus.txt")
> inspect(crude[1])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

$`reut-00001.xml`
<<PlainTextDocument (metadata: 15)>>
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter

我遇到了同样的问题,corpus[[1]]$content 对我有用

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM