简体   繁体   中英

Sentiment analysis Python tokenization

my problem is the follow: I want to do a sentiment analysis on Italian tweet and I would to tokenise and lemmatise my Italian text in order to find new analysis dimension for my thesis. The problem is that I would like to tokenise my hashtag, splitting also the composed one. For example if I have #nogreenpass, I would have also without the # symbol, because the sentiment of the phrase would be better understand with all word of the text. How could I do this? I tried with sapCy, but I have no results. I created a function to clean my text, but I can't have the hashtag in the way I would. I'm using this code:

import re
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load('it_core_news_lg')

# Clean_text function
def clean_text(text):
    text = str(text).lower()
    doc = nlp(text)
    text = re.sub(r'#[a-z0-9]+', str(' '.join(t in nlp(doc))), str(text))
    text = re.sub(r'\n', ' ', str(text)) # Remove /n
    text = re.sub(r'@[A-Za-z0-9]+', '<user>', str(text)) # Remove and replace @mention
    text = re.sub(r'RT[\s]+', '', str(text)) # Remove RT
    text = re.sub(r'https?:\/\/\S+', '<url>', str(text)) # Remove and replace links
    return text

For example here I don't know how add the first < and last > replacing the # symbol and the tokenisation process doesn't work as I would. Thank you for the time spent for me and for the patience. I hope to became stronger in the Jupiter analysis and python coding so I could give an help also to your problem. Thank you guys!

You can tweak your current clean_code with

def clean_text(text):
    text = str(text).lower()
    text = re.sub(r'#(\w+)', r'<\1>', text)
    text = re.sub(r'\n', ' ', text) # Remove /n
    text = re.sub(r'@[A-Za-z0-9]+', '<user>', text) # Remove and replace @mention
    text = re.sub(r'RT\s+', '', text) # Remove RT
    text = re.sub(r'https?://\S+\b/?', '<url>', text) # Remove and replace links
    return text

See the Python demo online .

The following line of code:

print(clean_text("@Marcorossi hanno ragione I #novax htt"+"p://www.asfag.com/"))

will yield

<user> hanno ragione i <novax> <url>

Note there is no easy way to split a glued string into its constituent words. See How to split text without spaces into list of words for ideas how to do that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM