简体   繁体   中英

Building your own text corpus

It may sounds stupid, but do you know how to build text corpus? I have searched everywhere and there is already existing corpus, but I wonder how did they build it? For example, if I want to build corpus with positive and negative tweets, then I have to just make two files? But what about inner of those files? Dont get it(((( in this example he stores pos and neg tweets in RedisDB.

But what about inner of those files?

This depends mostly on what library you're using. XML (with a variety of tags) is common, as is one sentence per line. The tricky part is getting the data in the first place.

For example, if I want to build corpus with positive and negative tweets

Does this mean that you want to know how to mark the tweets as positive and negative? If so, what you're looking for is called text classification or semantic analysis.

If you want to find a bunch of tweets, I'd check one of these pages (just from a quick search of my own).

Clickonf5: http://clickonf5.org/5438/download-tweets-pdf-xml-format-local-machine-server/

Quora: http://quora.com/What-is-the-best-tool-to-download-and-archive-Twitter-data-of-certain-hashtags-and-mentions-for-academic-research

Google Groups: http://groups.google.com/forum/?fromgroups#!topic/twitter-development-talk/kfislDfxunI

For general learning about how to create a corpus, I would check out the Handbook of Natural Language Processing Wiki by Richard Xiao.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM