简体   繁体   中英

Removing unwanted words(characters) from tweets in a csv file in python

I have a csv file with 60000+ tweets. I have cleaned the file to a certain extent. But it still has words (mixed characters probably left out after urls cleaning) that do not make any sense. I am not allowed to post any images. So, I am posting a portion of the file. """

Fintech Bitcoin crowdfunding and cybersecurity fintech bitcoin crowdfunding and cybersecurity
monster has left earned total satoshi monstercoingame Bitcoin
Bitcoin TCH bitcoin btch
bitcoin iticoin SPPL BXsAJ
coindesk The latest Bitcoin Price Index USD pic twitter com aKk
Trends For Bitcoin Regulation ZKdFZS via CoinDeskpic twitter com KNKgFcdxYD
Now there Mike Tyson Bitcoin app theres mike tyson bitcoin app
BitcoinBet Positive and negative proofs blockchain audits Bitcoin Bitcoin via
The latest Bitcoin Price Index USD pic twitter com CivXlPj
Bitcoin price index pic twitter com xhQQ mbRIb

As you can see some characters (for example, aKk, KNKgFcdxYD, xhQQ) don't make any sense, so I want to remove them. They are stored in a column named [clean_tweet].

I have sort of stitched together the following code for the whole cleaning purpose (from raw tweets to the current version that I posted) but don't know how I could remove those "characters". My code is as follows. Any suggestions would be appreciated. Thank you.

import re
import pandas as pd 
import numpy as np 
import string
import nltk
from nltk.stem.porter import *
import warnings 
from datetime import datetime as dt

warnings.filterwarnings("ignore", category=DeprecationWarning)

tweets = pd.read_csv(r'myfilepath.csv')
df = pd.DataFrame(tweets, columns = ['date','text'])

df['date'] = pd.to_datetime(df['date']).dt.date #changing date to datetime format from time-series

#removing pattern from tweets

def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, '', input_txt)
    return input_txt   

# remove twitter handles (@user)
tweets['clean_tweet'] = np.vectorize(remove_pattern)(tweets['text'], "@[\w]*")
#remove urls    
tweets['clean_tweet'] = np.vectorize(remove_pattern)(tweets['text'], "https?://[A-Za-z./]*")

## remove special characters, numbers, punctuations
tweets['clean_tweet'] = tweets['clean_tweet'].str.replace("[^a-zA-Z#]", " ")
#      
tweets['clean_tweet'] = tweets['clean_tweet'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))  

It might be easier to qualify the characters you want instead of the universe of unwanted ones. Negative matching with regex?

    if (re.match(r'[A-Za-z0-9@#$%^&*()!-+='";:?', char) is not None) is False:
         replace(char, '')

Clean something like that regex up a little bit for what you're looking for and just loop through the characters of each string. And then, thank God for computers to do all the tedium work for you!

Following up my comments, I guess your task will become easier if you use an spell checker library to see if the words are valid in English or not.

Something like this (using enchant, for example):

import enchant
from pprint import pprint

en_us = enchant.Dict("en_US")
text = '''
Fintech Bitcoin crowdfunding and cybersecurity fintech bitcoin crowdfunding and cybersecurity
monster has left earned total satoshi monstercoingame Bitcoin
Bitcoin TCH bitcoin btch
bitcoin iticoin SPPL BXsAJ
coindesk The latest Bitcoin Price Index USD pic twitter com aKk
Trends For Bitcoin Regulation ZKdFZS via CoinDeskpic twitter com KNKgFcdxYD
Now there Mike Tyson Bitcoin app theres mike tyson bitcoin app
BitcoinBet Positive and negative proofs blockchain audits Bitcoin Bitcoin via
The latest Bitcoin Price Index USD pic twitter com CivXlPj
Bitcoin price index pic twitter com xhQQ mbRIb
'''
phrases = text.split('\n')
print('BEFORE')
pprint(phrases)

for i, phrase in enumerate(phrases):
    phrases[i] = ' '.join(w for w in phrase.split() if en_us.check(w))

print('AFTER')
pprint(phrases)

The code above will result in something like:

BEFORE
['',
 'Fintech Bitcoin crowdfunding and cybersecurity fintech bitcoin crowdfunding '
 'and cybersecurity',
 'monster has left earned total satoshi monstercoingame Bitcoin',
 'Bitcoin TCH bitcoin btch',
 'bitcoin iticoin SPPL BXsAJ',
 'coindesk The latest Bitcoin Price Index USD pic twitter com aKk',
 'Trends For Bitcoin Regulation ZKdFZS via CoinDeskpic twitter com KNKgFcdxYD',
 'Now there Mike Tyson Bitcoin app theres mike tyson bitcoin app',
 'BitcoinBet Positive and negative proofs blockchain audits Bitcoin Bitcoin '
 'via',
 'The latest Bitcoin Price Index USD pic twitter com CivXlPj',
 'Bitcoin price index pic twitter com xhQQ mbRIb',
 '']
AFTER
['',
 'Bitcoin and bitcoin and',
 'monster has left earned total Bitcoin',
 'Bitcoin bitcoin',
 'bitcoin',
 'The latest Bitcoin Price Index pic twitter com',
 'Trends For Bitcoin Regulation via twitter com',
 'Now there Mike Tyson Bitcoin app mike bitcoin app',
 'Positive and negative proofs audits Bitcoin Bitcoin via',
 'The latest Bitcoin Price Index pic twitter com',
 'Bitcoin price index pic twitter com',
 '']

BUT, as you can see, words like Fintech , crowdfunding , and cybersecurity (to list a few) were marked as NOT valid in English so you will need to fine tuning the code for your needs.

I hope it helps.

Update: to add word exceptions to your spell checker, do something like this:

exceptions = [
    'Fintech',
    'crowdfunding',
    'cybersecurity',
    'fintech',
    'crowdfunding',
    'cybersecurity',
    'satoshi',
    'monstercoingame',
    'TCH',
    'coindesk',
    'USD',
    'CoinDeskpic',
    'theres',
    'tyson',
    'BitcoinBet',
    'blockchain',
    'USD'
]

for word in exceptions:
    # add word to personal dictionary
    #en_us.add(word)
    # or add word just for this session only
    en_us.add_to_session(word)

there is a way to do it, by using nltk it will also remove url.

url need to be remopve first otherwise you will it will remove some words from url and make it worse

nltk.download('words') # if its needed
words = set(nltk.corpus.words.words())

def clean_tweets(text):
    text= re.sub(r'https.?://[^\s]+[\s]?', '', text)
    return " ".join(w for w in nltk.wordpunct_tokenize(text) \
     if w.lower() in words or not w.isalpha())

this will remove the nonsense words example

test = 'this is a  test KNKgFcdxYD to check https://stackoverflow.com/questions/295 xhQQ'
ret = clean_tweets(test)
print(ret)
# output
#this is a test to check

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM