简体   繁体   English

如何使用NLTK和Python标记和分块法语文本?

[英]How can I tag and chunk French text using NLTK and Python?

I have 30,000+ French-language articles in a JSON file. 我在JSON文件中有30,000多个法语文章。 I would like to perform some text analysis on both individual articles and on the set as a whole. 我想对单个文章和整个集合进行一些文本分析。 Before I go further, I'm starting with simple goals: 在我走得更远之前,我从简单的目标开始:

  • Identify important entities (people, places, concepts) 识别重要的实体(人,地方,概念)
  • Find significant changes in the importance (~=frequency) of those entities over time (using the article sequence number as a proxy for time) 发现这些实体在一段时间内的重要性(〜=频率)发生了重大变化(使用文章序列号作为时间的代理)

The steps I've taken so far: 到目前为止我采取的步骤:

  1. Imported the data into a python list: 将数据导入python列表:

     import json json_articles=open('articlefile.json') articlelist = json.load(json_articles) 
  2. Selected a single article to test, and concatenated the body text into a single string: 选择要测试的单个文章,并将正文文本连接成单个字符串:

     txt = ' '.join(data[10000]['body']) 
  3. Loaded a French sentence tokenizer and split the string into a list of sentences: 加载一个法语句子标记化器并将该字符串拆分为一个句子列表:

     nltk.data.load('tokenizers/punkt/french.pickle') tokens = [french_tokenizer.tokenize(s) for s in sentences] 
  4. Attempted to split the sentences into words using the WhiteSpaceTokenizer: 尝试使用WhiteSpaceTokenizer将句子拆分为单词:

     from nltk.tokenize import WhitespaceTokenizer wst = WhitespaceTokenizer() tokens = [wst.tokenize(s) for s in sentences] 

This is where I'm stuck, for the following reasons: 这是我被卡住的地方,原因如下:

  • NLTK doesn't have a built-in tokenizer which can split French into words. NLTK没有可以将法语分成单词的内置标记器。 White space doesn't work well, particular due to the fact it won't correctly separate on apostrophes. 白色空间不能很好地工作,特别是因为它不会在撇号上正确分离。
  • Even if I were to use regular expressions to split into individual words, there's no French PoS (parts of speech) tagger that I can use to tag those words, and no way to chunk them into logical units of meaning 即使我使用正则表达式分成单个单词,也没有法语PoS(词性)标记符可用于标记这些单词,并且无法将它们分成逻辑单元的意义

For English, I could tag and chunk the text like so: 对于英语,我可以像这样标记和分块文本:

    tagged = [nltk.pos_tag(token) for token in tokens]
    chunks = nltk.batch_ne_chunk(tagged)

My main options (in order of current preference) seem to be: 我的主要选项(按当前偏好顺序)似乎是:

  1. Use nltk-trainer to train my own tagger and chunker. 使用nltk-trainer训练我自己的tagger和chunker。
  2. Use the python wrapper for TreeTagger for just this part, as TreeTagger can already tag French, and someone has written a wrapper which calls the TreeTagger binary and parses the results. 使用TreeTagger的python包装器只是这部分,因为TreeTagger已经可以标记法语,并且有人编写了一个调用TreeTagger二进制文件并解析结果的包装器。
  3. Use a different tool altogether. 完全使用不同的工具。

If I were to do (1), I imagine I would need to create my own tagged corpus. 如果我做(1),我想我需要创建自己的标记语料库。 Is this correct, or would it be possible (and premitted) to use the French Treebank? 这是正确的,还是可以(并且允许)使用法国树库?

If the French Treebank corpus format ( example here ) is not suitable for use with nltk-trainer, is it feasible to convert it into such a format? 如果法国树库语料库格式( 此处示例 )不适合与nltk-trainer一起使用,将它转换为这种格式是否可行?

What approaches have French-speaking users of NLTK taken to PoS tag and chunk text? 将NLTK的法语用户采用PoS标签和块文本的方法是什么?

There is also TreeTagger (supporting french corpus) with a Python wrapper. 还有TreeTagger(支持法语语料库)和Python包装器。 This is the solution I am currently using and it works quite good. 这是我目前正在使用的解决方案,它的效果非常好。

As of version 3.1.0 (January 2012), the Stanford PoS tagger supports French. 从版本3。1。0(2012年1月)开始, 斯坦福PoS标记器支持法语。

It should be possible to use this French tagger in NLTK, using Nitin Madnani's Interface to the Stanford POS-tagger 应该可以在NLTK中使用这个法语标记器,使用Nitin Madnani的接口到斯坦福POS标记器

I haven't tried this yet, but it sounds easier than the other approaches I've considered, and I should be able to control the entire pipeline from within a Python script. 我还没有尝试过,但这听起来比我考虑的其他方法更容易,我应该能够在Python脚本中控制整个管道。 I'll comment on this post when I have an outcome to share. 当我有分享的结果时,我会评论这篇文章。

Here are some suggestions: 以下是一些建议:

  1. WhitespaceTokenizer is doing what it's meant to. WhitespaceTokenizer正在做它的意图。 If you want to split on apostrophes, try WordPunctTokenizer , check out the other available tokenizers, or roll your own with Regexp tokenizer or directly with the re module. 如果要分割撇号,请尝试使用WordPunctTokenizer ,检查其他可用的标记生成器,或使用Regexp标记生成器或直接使用re模块进行自我标记。

  2. Make sure you've resolved text encoding issues (unicode or latin1), otherwise the tokenization will still go wrong. 确保您已解决文本编码问题(unicode或latin1),否则标记化仍会出错。

  3. The nltk only comes with the English tagger, as you discovered. 正如您所发现的那样,nltk只附带英文标记器。 It sounds like using TreeTagger would be the least work, since it's (almost) ready to use. 听起来使用TreeTagger是最不起作用的,因为它(几乎)准备好使用。

  4. Training your own is also a practical option. 训练自己也是一个实用的选择。 But you definitely shouldn't create your own training corpus! 但你绝对不应该创建自己的训练语料库! Use an existing tagged corpus of French. 使用现有的标记法语语料库。 You'll get best results if the genre of the training text matches your domain (articles). 如果培训文本的类型与您的域(文章)匹配,您将获得最佳结果。 Also, you can use nltk-trainer but you could also use the NLTK features directly. 此外,您可以使用nltk-trainer,但您也可以直接使用NLTK功能。

  5. You can use the French Treebank corpus for training, but I don't know if there's a reader that knows its exact format. 你可以使用French Treebank语料库进行培训,但我不知道是否有读者知道它的确切格式。 If not, you must start with XMLCorpusReader and subclass it to provide a tagged_sents() method. 如果没有,则必须从XMLCorpusReader开始并将其子类化以提供tagged_sents()方法。

  6. If you're not already on the nltk-users mailing list, I think you'll want to get on it. 如果你还没有在nltk-users邮件列表中,我想你会想要加入它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM