如何使用python预处理Twitter文本数据

Question

从mongoDB检索后，我有以下格式的文本数据：

**

[u'In', u'love', u'#Paralympics?\U0001f60d', u"We've", u'got', u'nine', u'different', u'sports', u'live', u'streams', u'https://not_a_real_link', u't_https://anotherLink']

[u't_https://somelink']

[u'RT', u'@sportvibz:', u'African', u'medal', u'table', u'#Paralympics', u't_https://somelink', u't_https://someLink']

**

但是我想用单词“ URL”替换列表中的所有URL，同时保留列表中的其他文本，例如：

[u'In', u'love', u'#Paralympics?\U0001f60d', u"We've", u'got', u'nine', u'different', u'sports', u'live', u'streams', u'URL', u'URL']

但是，当我运行代码以删除停用词并执行正则表达式时，我得到以下结果样本：

**

In

URL

RT

**

请大家帮忙，因为我发现这很困难。

这是我目前的代码：

def stopwordsRemover(self, rawText):
    stop = stopwords.words('english')
    ##remove stop words from the rawText argument and store the result list in processedText variable
    processedText = [i for i in rawText.split() if i not in stop]
    return processedText


def clean_text(self, rawText):
    temp_raw = rawText
    for i, text in enumerate(temp_raw):
        temp = re.sub(r'https?:\/\/.*\/[a-zA-Z0-9]*', 'URL', text)
    return temp

Answer 1

这是错误的：

def clean_text(self, rawText):
    temp_raw = rawText
    for i, text in enumerate(temp_raw):
        temp = re.sub(r'https?:\/\/.*\/[a-zA-Z0-9]*', 'URL', text)
    return temp

您返回最后一个替换的字符串而不是列表，该列表应该替换您的rawText输入列表（我必须承认，我似乎很快就对您似乎获得了第一项感到困惑，但我仍然对解释充满信心）

改为：

def clean_text(self, rawText):
    temp = list()
    for text in rawText:
        temp.append(re.sub(r'https?:\/\/.*\/\w*', 'URL', text))  # simpler regex with \w
    return temp

使用listcomp：

def clean_text(self, rawText):
   return [re.sub(r'https?:\/\/.*\/\w*', 'URL', text) for text in rawText]

您也可以就地工作，直接修改rawText ：

def clean_text(self, rawText):
    rawText[:] = [re.sub(r'https?:\/\/.*\/\w*', 'URL', text) for text in rawText]

如何使用python预处理Twitter文本数据

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-10-07 14:58:05

如何使用python预处理Twitter文本数据

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-10-07 14:58:05

解决方案1
0 已采纳 2016-10-07 14:58:05