简体   繁体   中英

Are there any opportunities to tokenize hashtags into multi-words tokens?

I am currently analyzing Instagram postings which often have hashtags containing more than one word (eg #pictureoftheday).

However, tokenizing them within the R package tidytext results in only one token. Instead, I would like to have more than one token like "picture" "of" "the" "day". Unfortunately, I have not found a package capable of doing so. Do you know any R package allowing this approach?

Thanks in advance!

As fare as I know - you can't split joined words without knowing they are just that - words. You know if the hashtags are split by a delimiter then it would be easy. Without it becomes very complex. You need a language dependent dictionary.

You probably have to process your data separately yourself.

try this Python repo: ekphrasis


    from ekphrasis.classes.segmenter import Segmenter
    seg = Segmenter(corpus="mycorpus") 
    print(seg.segment("smallandinsignificant"))

output:


    > small and insignificant

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM