How to preprocess twitter text data using python

Question

I have text data after retrieval from a mongoDB in this format:

**

[u'In', u'love', u'#Paralympics?\U0001f60d', u"We've", u'got', u'nine', u'different', u'sports', u'live', u'streams', u'https://not_a_real_link', u't_https://anotherLink']

[u't_https://somelink']

[u'RT', u'@sportvibz:', u'African', u'medal', u'table', u'#Paralympics', u't_https://somelink', u't_https://someLink']

**

However I would like to replace all URLs in the list with the word 'URL' while preserving other texts in the list, ie to something like this:

[u'In', u'love', u'#Paralympics?\U0001f60d', u"We've", u'got', u'nine', u'different', u'sports', u'live', u'streams', u'URL', u'URL']

But when I run the code for stopword removal and also perform regular expression I get this result sample :

**

In

URL

RT

**

Please could anyone help with this, as I'm finding this difficult.

Here is the code I have at the moment:

def stopwordsRemover(self, rawText):
    stop = stopwords.words('english')
    ##remove stop words from the rawText argument and store the result list in processedText variable
    processedText = [i for i in rawText.split() if i not in stop]
    return processedText


def clean_text(self, rawText):
    temp_raw = rawText
    for i, text in enumerate(temp_raw):
        temp = re.sub(r'https?:\/\/.*\/[a-zA-Z0-9]*', 'URL', text)
    return temp

Answer 1

This is wrong:

def clean_text(self, rawText):
    temp_raw = rawText
    for i, text in enumerate(temp_raw):
        temp = re.sub(r'https?:\/\/.*\/[a-zA-Z0-9]*', 'URL', text)
    return temp

you return the last substituted string instead of a list, that should replace your rawText input list (I must admit I'm puzzled by the fast that you seem to get the first item, but I'm still confident on the explanation)

do that instead:

def clean_text(self, rawText):
    temp = list()
    for text in rawText:
        temp.append(re.sub(r'https?:\/\/.*\/\w*', 'URL', text))  # simpler regex with \w
    return temp

with a listcomp:

def clean_text(self, rawText):
   return [re.sub(r'https?:\/\/.*\/\w*', 'URL', text) for text in rawText]

you could also work in-place, modifying rawText directly:

def clean_text(self, rawText):
    rawText[:] = [re.sub(r'https?:\/\/.*\/\w*', 'URL', text) for text in rawText]

How to preprocess twitter text data using python

Question

1 answers

solution1
0 ACCPTED 2016-10-07 14:58:05

How to preprocess twitter text data using python

Question

1 answers

solution1 0 ACCPTED 2016-10-07 14:58:05

solution1
0 ACCPTED 2016-10-07 14:58:05