R：德語文本上的tm包

Question

我想對德語數據集進行情感分類，我正在使用以下代碼，該代碼在英語文本上可以正常工作，但是在德語文本的情況下會出現錯誤。

這是我的以下代碼：

#loading required libraries
library(tm)
library(readxl)
library(data.table)
library(plyr)
library(dplyr)
library(zoo)
library(ggplot2)
library(ranger)
library(e1071)

df<- data.table(read_excel("data/German2datasets.xlsx", skip = 1))

# An abstract function to preprocess a text column
preprocess <- function(text_column)
{
  # Use tm to get a doc matrix
  corpus <- Corpus(VectorSource(text_column))
  # all lower case
  corpus <- tm_map(corpus, content_transformer(tolower))
  # remove punctuation
  corpus <- tm_map(corpus, content_transformer(removePunctuation))
  # remove numbers
  corpus <- tm_map(corpus, content_transformer(removeNumbers))
  # remove stopwords
  corpus <- tm_map(corpus, removeWords, stopwords("german"))
  # stem document
  corpus <- tm_map(corpus, stemDocument)
  # strip white spaces (always at the end)
  corpus <- tm_map(corpus, stripWhitespace)
  # return
  corpus    
}

# Get preprocess training and test data
corpus <- preprocess(df$TEXT)


# Create a Document Term Matrix for train and test
# Just including bi and tri-grams

Sys.setenv(JAVA_HOME='D://Program Files/Java/jre1.8.0_112') # for 32-bit version
library(rJava)
library(RWeka)

# Bi-Trigram tokenizer function (you can always get longer n-grams)
bitrigramtokeniser <- function(x, n) {
  RWeka:::NGramTokenizer(x, RWeka:::Weka_control(min = 2, max = 3))
}


"
Remove remove words <=2
TdIdf weighting
Infrequent (< than 1% of documents) and very frequent (> 80% of documents) terms not included
"

dtm <- DocumentTermMatrix(corpus, control=list(wordLengths=c(2, Inf), 
                                               tokenize = bitrigramtokeniser, 
                                               weighting = function(x) weightTfIdf(x, normalize = FALSE),
                                               bounds=list(global=c(floor(length(corpus)*0.01), floor(length(corpus)*.8)))))


sent <- df$Sentiment

# Variable selection
# ~~~~~~~~~~~~~~~~~~~~
"
For dimension reduction.
The function calculates chi-square value for each phrase and keeps phrases with highest chi_square values
Ideally you want to put variable selection as part of cross-validation.

chisqTwo function takes:
document term matrix (dtm), 
vector of labels (labels), and 
number of n-grams you want to keep (n_out)

"
chisqTwo <- function(dtm, labels, n_out=2000){
  mat       <- as.matrix(dtm)
  cat1      <-  colSums(mat[labels==T,])        # total number of times phrase used in cat1 
  cat2      <-  colSums(mat[labels==F,])        # total number of times phrase used in cat2 
  n_cat1        <-  sum(mat[labels==T,]) - cat1     # total number of phrases in soft minus cat1
  n_cat2        <-  sum(mat[labels==F,]) - cat2     # total number of phrases in hard minus cat2

  num       <- (cat1*n_cat2 - cat2*n_cat1)^2
  den       <- (cat1 + cat2)*(cat1 + n_cat1)*(cat2 + n_cat2)*(n_cat1 + n_cat2)
  chisq         <- num/den

  chi_order <- chisq[order(chisq)][1:n_out]   
  mat       <- mat[, colnames(mat) %in% names(chi_order)]

}


n <- nrow(dtm)
shuffled <- dtm[sample(n),]
train_dtm <- shuffled[1:round(0.7 * n),]
test_dtm <- shuffled[(round(0.7 * n) + 1):n,]


"
With high dimensional data, test matrix may not have all the phrases training matrix has.
This function fixes that - so that test matrix has the same columns as training.
testmat takes column names of training matrix (train_mat_cols), and 
test matrix (test_mat)
and outputs test_matrix with the same columns as training matrix
"
# Test matrix maker
testmat <- function(train_mat_cols, test_mat){  
  # train_mat_cols <- colnames(train_mat); test_mat <- as.matrix(test_dtm)
  test_mat  <- test_mat[, colnames(test_mat) %in% train_mat_cols]

  miss_names    <- train_mat_cols[!(train_mat_cols %in% colnames(test_mat))]
  if(length(miss_names)!=0){
    colClasses  <- rep("numeric", length(miss_names))
    df          <- read.table(text = '', colClasses = colClasses, col.names = miss_names)
    df[1:nrow(test_mat),] <- 0
    test_mat    <- cbind(test_mat, df)
  }
  as.matrix(test_mat)
}

# Train and test matrices
train_mat <- chisqTwo(train_dtm, train$Sentiment)
test_mat  <- testmat(colnames(train_mat), as.matrix(test_dtm))

dim(train_mat)
dim(test_mat)


n <- nrow(df)
shuffled <- df[sample(n),]
train_data <- shuffled[1:round(0.7 * n),]
test_data <- shuffled[(round(0.7 * n) + 1):n,]

train_mat <- as.data.frame(as.matrix(train_mat))
colnames(train_mat) <- make.names(colnames(train_mat))
train_mat$Sentiment <- train_data$Sentiment

test_mat <- as.data.frame(as.matrix(test_mat))
colnames(test_mat) <- make.names(colnames(test_mat))
test_mat$Sentiment <- test_data$Sentiment

train_mat$Sentiment <- as.factor(train_mat$Sentiment)
test_mat$Sentiment <- as.factor(test_mat$Sentiment)

然后，我將在同一行上應用插入符號ML算法，以預測列車上的情緒並創建測試數據。

我在“預處理”功能中收到以下錯誤。

> corpus <- preprocess(df$TEXT)
 Show Traceback

 Rerun with Debug
 Error in FUN(content(x), ...) : 
  invalid input 'Ich bin seit Jahren zufrieden mit der Basler VersicherubgðŸŒº' in 'utf8towcs'

數據-https://drive.google.com/open?id=1T_LpL2G8upztihAC2SQeVs4YCPH - yfOs

Answer 1

嘗試使用其他程序包進入Weka之前的預備階段如何？ 這是等效的（和更簡單的恕我直言）：

library("quanteda")
library("readtext")

# reads in the spreadsheet and creates the corpus
germancorp <- 
    readtext("data/German2datasets.xlsx", text_field = "TEXT")) %>%
    corpus()

# does all of the steps of your preprocess() function
dtm <- dfm(germancorp, ngrams = c(2, 3),
           tolower = TRUE,
           remove_punct = TRUE,
           remove_numbers = TRUE,
           remove = stopwords("german"),
           stem = TRUE)

# remove words with only a single count
dtm <- dfm_trim(dtm, min_count = 2)

# form tf-idf weights - change the base argument from default 10 if you wish
dtm <- dfm_tfidf(dtm)

# if you really want a tm formatted DocumentTermMatrix
convert(dtm, to = "tm")

盡管不清楚您正在做什么，但Quanteda軟件包可以執行您列出的其他步驟。 （您的問題集中在preprocess()失敗上，所以我回答了。）

Answer 2

如果尚未找到原因，則為： 'utf8towcs'中的無效輸入

它是文件的編碼（取決於您的[虛擬]環境和當前的sys-options，當然還取決於創建時將文件保存到磁盤的方式）

解決方法是：

usableText=str_replace_all(tweets$text,"[^[:graph:]]", " ")

要么

your_corpus<- tm_map(your_corpus,toSpace,"[^[:graph:]]")

R：德語文本上的tm包

問題描述

2 個解決方案

解決方案1
1 2018-02-22 21:54:37

解決方案2
0 2018-04-02 17:45:37

R：德語文本上的tm包

問題描述

2 個解決方案

解決方案1 1 2018-02-22 21:54:37

解決方案2 0 2018-04-02 17:45:37

解決方案1
1 2018-02-22 21:54:37

解決方案2
0 2018-04-02 17:45:37