简体   繁体   English

是否有机会将主题标签标记为多词标记?

[英]Are there any opportunities to tokenize hashtags into multi-words tokens?

I am currently analyzing Instagram postings which often have hashtags containing more than one word (eg #pictureoftheday).我目前正在分析 Instagram 帖子,这些帖子通常有包含多个单词的主题标签(例如#pictureoftheday)。

However, tokenizing them within the R package tidytext results in only one token.但是,在 R package tidytext它们标记化只会产生一个标记。 Instead, I would like to have more than one token like "picture" "of" "the" "day".相反,我想拥有多个标记,例如“图片”“的”“那一天”。 Unfortunately, I have not found a package capable of doing so.不幸的是,我还没有找到能够这样做的 package。 Do you know any R package allowing this approach?你知道任何允许这种方法的 R package 吗?

Thanks in advance!提前致谢!

As fare as I know - you can't split joined words without knowing they are just that - words.据我所知——你不能在不知道它们就是这样的情况下拆分连接的词。 You know if the hashtags are split by a delimiter then it would be easy.您知道如果主题标签被分隔符分割,那么它会很容易。 Without it becomes very complex.没有它会变得非常复杂。 You need a language dependent dictionary.你需要一个依赖于语言的字典。

You probably have to process your data separately yourself.您可能必须自己单独处理数据。

try this Python repo: ekphrasis试试这个 Python 回购: ekphrasis


    from ekphrasis.classes.segmenter import Segmenter
    seg = Segmenter(corpus="mycorpus") 
    print(seg.segment("smallandinsignificant"))

output: output:


    > small and insignificant

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM