简体   繁体   English

如何使用python预处理Twitter文本数据

[英]How to preprocess twitter text data using python

I have text data after retrieval from a mongoDB in this format: mongoDB检索后,我有以下格式的文本数据:

** **

[u'In', u'love', u'#Paralympics?\U0001f60d', u"We've", u'got', u'nine', u'different', u'sports', u'live', u'streams', u'https://not_a_real_link', u't_https://anotherLink']

[u't_https://somelink']

[u'RT', u'@sportvibz:', u'African', u'medal', u'table', u'#Paralympics', u't_https://somelink', u't_https://someLink']

** **

However I would like to replace all URLs in the list with the word 'URL' while preserving other texts in the list, ie to something like this: 但是我想用单词“ URL”替换列表中的所有URL,同时保留列表中的其他文本,例如:

[u'In', u'love', u'#Paralympics?\U0001f60d', u"We've", u'got', u'nine', u'different', u'sports', u'live', u'streams', u'URL', u'URL']

But when I run the code for stopword removal and also perform regular expression I get this result sample : 但是,当我运行代码以删除停用词并执行正则表达式时,我得到以下结果样本:

** **

In

URL

RT

** **

Please could anyone help with this, as I'm finding this difficult. 请大家帮忙,因为我发现这很困难。

Here is the code I have at the moment: 这是我目前的代码:

def stopwordsRemover(self, rawText):
    stop = stopwords.words('english')
    ##remove stop words from the rawText argument and store the result list in processedText variable
    processedText = [i for i in rawText.split() if i not in stop]
    return processedText


def clean_text(self, rawText):
    temp_raw = rawText
    for i, text in enumerate(temp_raw):
        temp = re.sub(r'https?:\/\/.*\/[a-zA-Z0-9]*', 'URL', text)
    return temp

This is wrong: 这是错误的:

def clean_text(self, rawText):
    temp_raw = rawText
    for i, text in enumerate(temp_raw):
        temp = re.sub(r'https?:\/\/.*\/[a-zA-Z0-9]*', 'URL', text)
    return temp

you return the last substituted string instead of a list, that should replace your rawText input list (I must admit I'm puzzled by the fast that you seem to get the first item, but I'm still confident on the explanation) 您返回最后一个替换的字符串而不是列表,该列表应该替换您的rawText输入列表(我必须承认,我似乎很快就对您似乎获得了第一项感到困惑,但我仍然对解释充满信心)

do that instead: 改为:

def clean_text(self, rawText):
    temp = list()
    for text in rawText:
        temp.append(re.sub(r'https?:\/\/.*\/\w*', 'URL', text))  # simpler regex with \w
    return temp

with a listcomp: 使用listcomp:

def clean_text(self, rawText):
   return [re.sub(r'https?:\/\/.*\/\w*', 'URL', text) for text in rawText]

you could also work in-place, modifying rawText directly: 您也可以就地工作,直接修改rawText

def clean_text(self, rawText):
    rawText[:] = [re.sub(r'https?:\/\/.*\/\w*', 'URL', text) for text in rawText]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用NiBabel(Python)预处理NIfTI数据格式 - How to preprocess NIfTI data format using NiBabel (Python) 如何在Python中动态预处理文本流? - How to preprocess a text stream on the fly in Python? 如何在Python中预处理时间序列数据以进行预测 - How to preprocess time series data in Python for forecasting 有没有更快的方法来预处理 Python 中的大量文本数据? - Is there a faster way to preprocess huge amount of text data in Python? 如何优化预处理所有文本文档而不使用for循环在每次迭代中预处理单个文本文档? - How to optimize preprocess all text documents without using for loop to preprocess a single text document in each iteration? 如何预处理图像以去除噪声并提取文本 Python? - How to preprocess an image to remove noise and extract text Python? 如何预处理一个巨大的数据集并保存它以便我可以在 Python 中训练数据 - How to preprocess a huge dataset and save it such that I can train the data in Python 如何预处理并将“大数据”tsv文件加载到python数据帧中? - How to preprocess and load a “big data” tsv file into a python dataframe? 如何对分类数据和数据框进行预处理 - How to preprocess with categorical data and dataframes 使用 Python 正则表达式处理 Twitter 数据 - Using Python regex for twitter data
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM