简体   繁体   English

法语文本中最常用的单词

[英]most frequent words in a french text

I am using the python nltk package to find the most frequent words in a French text. 我正在使用python nltk软件包在法语文本中查找最常用的单词。 I find it not really working... Here is my code: 我发现它实际上不起作用...这是我的代码:

#-*- coding: utf-8 -*-

#nltk: package for text analysis
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import nltk
import tokenize
import codecs
import unicodedata


#output French accents correctly
def convert_accents(text):
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')



### MAIN ###

#openfile
text_temp=codecs.open('text.txt','r','utf-8').readlines()

#put content in a list
text=[]
for word in text_temp:
    word=word.strip().lower()
    if word!="":
        text.append(convert_accents(word))

#tokenize the list
text=nltk.tokenize.word_tokenize(str(text))

#use FreqDist to get the most frequents words
fdist = FreqDist()
for word in  text:
    fdist.inc( word )
print "BEFORE removing meaningless words"
print fdist.items()[:10]

#use stopwords to remove articles and other meaningless words
for sw in stopwords.words("french"):
     if fdist.has_key(sw):
          fdist.pop(sw)
print "AFTER removing meaningless words"
print fdist.items()[:10]

Here is the output: 这是输出:

BEFORE removing meaningless words
[(',', 85), ('"', 64), ('de', 59), ('la', 47), ('a', 45), ('et', 40), ('qui', 39), ('que', 33), ('les', 30), ('je', 24)]
AFTER removing meaningless words
[(',', 85), ('"', 64), ('a', 45), ('les', 30), ('parce', 15), ('veut', 14), ('exigence', 12), ('aussi', 11), ('pense', 11), ('france', 10)]

My problem is that stopwords does not discard all the meaningless words. 我的问题是stopwords不会丢弃所有无意义的词。 For example ',' is not a word and should be removed, 'les' is an article and should be removed. 例如,“,”不是单词,应删除,“ les”是文章,应删除。

How to fix the problem? 如何解决问题?

The text I used can be found at this page: http://www.elysee.fr/la-presidence/discours-d-investiture-de-nicolas-sarkozy/ 我使用的文字可以在以下页面找到: http : //www.elysee.fr/la-presidence/discours-d-investiture-de-nicolas-sarkozy/

Usually its a better idea to use a list of stopwords of your own. 通常,最好使用自己的停用词列表。 For this purpose, you can get a list of French stopwords from here . 为此,您可以从此处获得法语停用词的列表。 The article word 'les' is also on the list. 列表中还显示了单词“ les”。 Create a text file of them and use the file to remove stopwords from your corpus. 创建一个文本文件,并使用该文件从您的语料库中删除停用词。 Then for punctuations you have to write a punctuation removal function. 然后对于标点符号,您必须编写一个标点符号删除功能。 How you should write it, highly depends on your application. 如何编写它,在很大程度上取决于您的应用程序。 But just to show you a few examples that would get you started, you can write: 但是,仅向您展示一些入门示例,您可以编写:

import string
t = "hello, eric! how are you?"
print t.translate(string.maketrans("",""), string.punctuation)

and the output is: 输出为:

hello eric how are you

or, another way is to simply write: 或者,另一种方法是简单地写:

t = t.split()
for w in t:
    w = w.strip('\'"?,.!_+=-')
    print w

So, it really depends on how you need them to be removed. 因此,这实际上取决于您需要如何删除它们。 In certain scenarios these methods might not result in what you exactly wanted. 在某些情况下,这些方法可能无法得到您真正想要的。 But, you can build on them. 但是,您可以在它们之上构建。 Let me know if you had any further questions. 让我知道您是否还有其他问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM