简体   繁体   中英

splitting words in string based on '\n'

guys I have a string that I am trying to make an ngram of, but I have a problem, when I do ngram = ngrams(raw_text.split(" "), n=1 the output is

[('come',), ('here,',), ('girl\noh,',), ('you',)....]

The problem is that in my string the words are arranged like:

come here, girl\noh, you want...

which means that my ngram is much bigger than it needs to be so what would I do to get a string like

come here , girl \n oh , you ... 

so that my ngram is an order smaller thanks guys I hope youre all having a good day

EDIT i am now aware that Im using a delimiter and have changed that... so \\n problem gone, but can I split the words within a string that have punctuation in them?

Can I split the words within a string that have punctuation in them?

Your final result is still not clear: do you want to include punctuation or just discard it entirely? Assuming that you don't need the punctuation, this is trivial using re.split() :

>>> import re
>>> re.split(r'\W+', "Hello, this'll split by\n \nwhitespace and also puncutation!")
['Hello', 'this', 'll', 'split', 'by', 'whitespace', 'and', 'also', 'puncutation', '']

If you want to split in a smarter way, this can quickly become complicated. I recommend using the nltk toolkit, which provides among other options nltk.word_tokenize :

>>> import nltk
>>> nltk.word_tokenize("Hello, this'll split by\n \nwhitespace and also puncutation!")
['Hello', ',', 'this', "'ll", 'split', 'by', 'whitespace', 'and', 'also', 'puncutation', '!']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM