Are there any opportunities to tokenize hashtags into multi-words tokens?

Question

I am currently analyzing Instagram postings which often have hashtags containing more than one word (eg #pictureoftheday).

However, tokenizing them within the R package tidytext results in only one token. Instead, I would like to have more than one token like "picture" "of" "the" "day". Unfortunately, I have not found a package capable of doing so. Do you know any R package allowing this approach?

Thanks in advance!

Answer 1

As fare as I know - you can't split joined words without knowing they are just that - words. You know if the hashtags are split by a delimiter then it would be easy. Without it becomes very complex. You need a language dependent dictionary.

You probably have to process your data separately yourself.

Answer 2

try this Python repo: ekphrasis


    from ekphrasis.classes.segmenter import Segmenter
    seg = Segmenter(corpus="mycorpus") 
    print(seg.segment("smallandinsignificant"))

output:


    > small and insignificant

Are there any opportunities to tokenize hashtags into multi-words tokens?

Question

2 answers

solution1
0 2021-12-12 09:32:43

solution2
0 2023-01-07 09:24:14

Are there any opportunities to tokenize hashtags into multi-words tokens?

Question

2 answers

solution1 0 2021-12-12 09:32:43

solution2 0 2023-01-07 09:24:14

solution1
0 2021-12-12 09:32:43

solution2
0 2023-01-07 09:24:14