简体   繁体   中英

Python Pandas handle special characters in strings

I write a function which I want to apply to a dataframe later.

def get_word_count(text,df):
    #text is a lowercase list of words
    #df is a dataframe with 2 columns: word and count
    #this function updates the word  counts

    stopwords='in the and an - '

    for word in text:
        if word not in stopwords:

            if df['word'].str.contains(word).any():
                df.loc[df['word']==word, 'count']=df['count']+1

    return df

Then I check it:

sentence1='[first] - missing "" in the text [first] word'.split()
y=get_word_count(sentence1, word_df)
sentence2="error: wrong word in the [second]  text".split()
y=get_word_count(sentence2, word_df)

I get the following results:

Word     Count

[first]    2    
missing    1 
""         1
text       2
word       2
error:     1
wrong      1

So where is [second] from the sentence2 ?
If I omit one of square brackets I get an error message. How do I handle words with special characters? Note that I don't want to get rid of them because if I do, I will miss "" in the sentence1 .

The problem comes from the line:

if df['word'].str.contains(word).any():

This reports if any of the words in the word column contains the given word. The DataFrame from df['word'].str.contains(word) reports True when [second] is given and compared to specifically [first] .

For a quick fix, I changed the line to:

if word in df['word'].tolist():

Creating a DataFrame in a loop like that is not recommended, you should do something like this:

stopwords='in the and an - '
sentence = sentence1+sentence2
df = pd.DataFrame([sentence.split()]).T
df.rename(columns={0: 'Words'}, inplace=True)
df = df.groupby(by=['Words'])['Words'].size().reset_index(name='counts')
df = df[~df['Words'].isin(stopwords.split())]

       Words  counts
0         ""       1
2    [first]       2
3   [second]       1
4     error:       1
6    missing       1
7       text       2
9       word       2
10     wrong       1

I have rebuild it in a way you can add sentences and see the frequency growing

from collections import Counter
from collections import defaultdict

import pandas as pd

def terms_frequency(corpus, stop_words=None):

    Takes in texts and returns a pandas DataFrame of words frequency

    corpus_ = corpus.split()

    # remove stop wors

    terms = [word for word in corpus_ if word not in stop_words]
    terms_freq = pd.DataFrame.from_dict(Counter(terms), orient='index').reset_index()

    terms_freq = terms_freq.rename(columns={'index':'word', 0:'count'}).sort_values('count',ascending=False)


    return terms_freq

def get_sentence(sentence, storage, stop_words=None):
    corpus = ' '.join(s for s in storage['sentences'])
    return terms_frequency(corpus,stop_words)

# tests
STOP_WORDS = 'in the and an - '
storage = defaultdict(list)

S1 = '[first] - missing "" in the text [first] word'

print('\nNext S2')
S2 = 'error: wrong word in the [second]  text'


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM