Python Pandas 處理字符串中的特殊字符

Question

我寫了一個 function，稍后我想申請一個 dataframe。


def get_word_count(text,df):
    #text is a lowercase list of words
    #df is a dataframe with 2 columns: word and count
    #this function updates the word  counts


    #f=open('stopwords.txt','r')
    #stopwords=f.read()
    stopwords='in the and an - '

    for word in text:
        if word not in stopwords:

            if df['word'].str.contains(word).any():
                df.loc[df['word']==word, 'count']=df['count']+1
            else:
                df.loc[0]=[word,1]
                df.index=df.index+1

    return df

然后我檢查一下：


word_df=pd.DataFrame(columns=['word','count'])
sentence1='[first] - missing "" in the text [first] word'.split()
y=get_word_count(sentence1, word_df)
sentence2="error: wrong word in the [second]  text".split()
y=get_word_count(sentence2, word_df)
y

我得到以下結果：

 
Word     Count

[first]    2    
missing    1 
""         1
text       2
word       2
error:     1
wrong      1

那么sentence2中的[second]在哪里？
如果我省略其中一個方括號，我會收到一條錯誤消息。 如何處理帶有特殊字符的單詞？ 請注意，我不想擺脫它們，因為如果我這樣做，我會錯過sentence1中的"" 。

Answer 1

問題來自以下行：

if df['word'].str.contains(word).any():

這會報告word列中的任何單詞是否包含給定的單詞。 來自df['word'].str.contains(word)在給出[second]並與特定[first]進行比較時報告True 。

為了快速修復，我將行更改為：

if word in df['word'].tolist():

Answer 2

不建議在這樣的循環中創建 DataFrame，您應該這樣做：

stopwords='in the and an - '
sentence = sentence1+sentence2
df = pd.DataFrame([sentence.split()]).T
df.rename(columns={0: 'Words'}, inplace=True)
df = df.groupby(by=['Words'])['Words'].size().reset_index(name='counts')
df = df[~df['Words'].isin(stopwords.split())]
print(df)

       Words  counts
0         ""       1
2    [first]       2
3   [second]       1
4     error:       1
6    missing       1
7       text       2
9       word       2
10     wrong       1

Answer 3

我以一種你可以添加句子並看到頻率增長的方式重建它

from collections import Counter
from collections import defaultdict

import pandas as pd

def terms_frequency(corpus, stop_words=None):

    '''
    Takes in texts and returns a pandas DataFrame of words frequency

    '''
    corpus_ = corpus.split()

    # remove stop wors

    terms = [word for word in corpus_ if word not in stop_words]
    terms_freq = pd.DataFrame.from_dict(Counter(terms), orient='index').reset_index()

    terms_freq = terms_freq.rename(columns={'index':'word', 0:'count'}).sort_values('count',ascending=False)

    terms_freq.reset_index(inplace=True)
    terms_freq.drop('index',axis=1,inplace=True)

    return terms_freq


def get_sentence(sentence, storage, stop_words=None):
    storage['sentences'].append(sentence)
    corpus = ' '.join(s for s in storage['sentences'])
    return terms_frequency(corpus,stop_words)



# tests
STOP_WORDS = 'in the and an - '
storage = defaultdict(list)

S1 = '[first] - missing "" in the text [first] word'
print(get_sentence(S1,storage,STOP_WORDS))

print('\nNext S2')
S2 = 'error: wrong word in the [second]  text'

print(get_sentence(S2,storage,STOP_WORDS))

Python Pandas 處理字符串中的特殊字符

問題描述

3 個解決方案

解決方案1
0 已采納 2020-05-15 17:54:56

解決方案2
0 2020-05-15 18:06:44

解決方案3
0 2020-05-15 18:32:10

Python Pandas 處理字符串中的特殊字符

問題描述

3 個解決方案

解決方案1 0 已采納 2020-05-15 17:54:56

解決方案2 0 2020-05-15 18:06:44

解決方案3 0 2020-05-15 18:32:10

解決方案1
0 已采納 2020-05-15 17:54:56

解決方案2
0 2020-05-15 18:06:44

解決方案3
0 2020-05-15 18:32:10