guys I have a string that I am trying to make an ngram of, but I have a problem, when I do ngram = ngrams(raw_text.split(" "), n=1
the output is
[('come',), ('here,',), ('girl\noh,',), ('you',)....]
The problem is that in my string the words are arranged like:
come here, girl\noh, you want...
which means that my ngram is much bigger than it needs to be so what would I do to get a string like
come here , girl \n oh , you ...
so that my ngram is an order smaller thanks guys I hope youre all having a good day
EDIT i am now aware that Im using a delimiter and have changed that... so \\n problem gone, but can I split the words within a string that have punctuation in them?
Can I split the words within a string that have punctuation in them?
Your final result is still not clear: do you want to include punctuation or just discard it entirely? Assuming that you don't need the punctuation, this is trivial using re.split()
:
>>> import re
>>> re.split(r'\W+', "Hello, this'll split by\n \nwhitespace and also puncutation!")
['Hello', 'this', 'll', 'split', 'by', 'whitespace', 'and', 'also', 'puncutation', '']
If you want to split in a smarter way, this can quickly become complicated. I recommend using the nltk
toolkit, which provides among other options nltk.word_tokenize
:
>>> import nltk
>>> nltk.word_tokenize("Hello, this'll split by\n \nwhitespace and also puncutation!")
['Hello', ',', 'this', "'ll", 'split', 'by', 'whitespace', 'and', 'also', 'puncutation', '!']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.