I am currently analyzing Instagram postings which often have hashtags containing more than one word (eg #pictureoftheday).
However, tokenizing them within the R package tidytext
results in only one token. Instead, I would like to have more than one token like "picture" "of" "the" "day". Unfortunately, I have not found a package capable of doing so. Do you know any R package allowing this approach?
Thanks in advance!
As fare as I know - you can't split joined words without knowing they are just that - words. You know if the hashtags are split by a delimiter then it would be easy. Without it becomes very complex. You need a language dependent dictionary.
You probably have to process your data separately yourself.
try this Python repo: ekphrasis
from ekphrasis.classes.segmenter import Segmenter
seg = Segmenter(corpus="mycorpus")
print(seg.segment("smallandinsignificant"))
output:
> small and insignificant
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.