简体   繁体   English

在 R 中使用波斯语进行文本挖掘

[英]Text Mining in R with Persian

I'm looking to do some v. simple data mining (frequency, bigrams, trigrams) on some facebook posts in Persian that I've collected and archived in a csv.我希望对我收集并存档在 csv 中的波斯语的一些 facebook 帖子进行一些简单的数据挖掘(频率、二元组、三元组)。 Below is the script I would use with english language csv of facebook comments to unnest all individual words into their own column.下面是我将与 facebook 评论的英文 csv 一起使用的脚本,以将所有单个单词取消嵌套到他们自己的列中。

stp_tidy <- stc2 %>%
  filter(!str_detect(Message, "^RT")) %>%
  mutate(text = str_replace_all(Message, "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT","")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg_words) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

Does anyone know of any method for applying unnest_tokens in Persian (or Dari to be specific) script?有谁知道在波斯语(或特定的达里语)脚本中应用 unnest_tokens 的任何方法?

2 options. 2个选项。 First example is using quanteda, second example is using udpipe.第一个例子是使用 quanteda,第二个例子是使用 udpipe。

Note that the printing of the tibbles with farsi is weird, aka features and values tend to be printed in the wrong columns, but the data is correctly stored inside the objects for further processing.请注意,用波斯语打印小标题很奇怪,也就是特征和值往往打印在错误的列中,但数据正确存储在对象内以供进一步处理。 There is a slight difference in output between the 2 options. 2 个选项之间的输出略有不同。 But these tend to be negligable.但这些往往可以忽略不计。 Note that for reading in the data I used the readtext package.请注意,为了读取数据,我使用了 readtext 包。 This tends to play nice with quanteda.这往往与 quanteda 配合得很好。

1 quanteda 1 量子

library(quanteda)
library(readtext)
# library(stopwords)

stp_test <- readtext("stp_test.csv", encoding = "UTF-8")

stp_test$Message[stp_test$Message != ""]
stp_test$text[stp_test$text != ""]

# remove records with empty messages

stp_test <- stp_test[stp_test$Message != "", ]

stp_corp <- corpus(stp_test, 
                   docid_field = "doc_id",
                   text_field = "Message")


stp_toks <- tokens(stp_corp, remove_punct = TRUE)
stp_toks <- tokens_remove(stp_toks, stopwords::stopwords(language = "fa", source = "stopwords-iso"))


# step for creating ngrams 1-3 can be done here, after removing stopwords. 
# stp_ngrams <- tokens_ngrams(stp_toks, n = 1L:3L, concatenator = "_")

stp_dfm <- dfm(stp_toks)
textstat_frequency(stp_dfm)

# transform into tidy data.frame
library(dplyr)
library(tidyr)
quanteda_tidy_out <- convert(stp_dfm, to = "data.frame") %>% 
  pivot_longer(-document, names_to = "features")

2 udpipe 2 udpipe

library(udpipe)
model <- udpipe_download_model(language = "persian-seraji")
ud_farsi <- udpipe_load_model(model$file_model)

# use stp_test from quanteda example.
x <- udpipe_annotate(ud_farsi, doc_id = stp_test$doc_id, stp_test$Message)
stp_df <- as.data.frame(x)


# selecting only nouns and verbs and removing stopwords 
ud_tidy_out <- stp_df %>% 
  filter(upos %in% c("NOUN", "VERB"),
         !token %in% stopwords::stopwords(language = "fa", source = "stopwords-iso")) 

Both packages have a good vignettes and support pages.这两个软件包都有很好的小插图和支持页面。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM