[英]Python Pandas handle special characters in strings
I write a function which I want to apply to a dataframe later.我写了一个 function,稍后我想申请一个 dataframe。
def get_word_count(text,df):
#text is a lowercase list of words
#df is a dataframe with 2 columns: word and count
#this function updates the word counts
#f=open('stopwords.txt','r')
#stopwords=f.read()
stopwords='in the and an - '
for word in text:
if word not in stopwords:
if df['word'].str.contains(word).any():
df.loc[df['word']==word, 'count']=df['count']+1
else:
df.loc[0]=[word,1]
df.index=df.index+1
return df
Then I check it:然后我检查一下:
word_df=pd.DataFrame(columns=['word','count'])
sentence1='[first] - missing "" in the text [first] word'.split()
y=get_word_count(sentence1, word_df)
sentence2="error: wrong word in the [second] text".split()
y=get_word_count(sentence2, word_df)
y
I get the following results:我得到以下结果:
Word Count [first] 2 missing 1 "" 1 text 2 word 2 error: 1 wrong 1
So where is [second] from the sentence2 ?那么sentence2中的[second]在哪里?
If I omit one of square brackets I get an error message.如果我省略其中一个方括号,我会收到一条错误消息。 How do I handle words with special characters?
如何处理带有特殊字符的单词? Note that I don't want to get rid of them because if I do, I will miss "" in the sentence1 .
请注意,我不想摆脱它们,因为如果我这样做,我会错过sentence1中的"" 。
The problem comes from the line:问题来自以下行:
if df['word'].str.contains(word).any():
This reports if any of the words in the word
column contains the given word.这会报告
word
列中的任何单词是否包含给定的单词。 The DataFrame from df['word'].str.contains(word)
reports True
when [second]
is given and compared to specifically [first]
.来自
df['word'].str.contains(word)
在给出[second]
并与特定[first]
进行比较时报告True
。
For a quick fix, I changed the line to:为了快速修复,我将行更改为:
if word in df['word'].tolist():
Creating a DataFrame in a loop like that is not recommended, you should do something like this:不建议在这样的循环中创建 DataFrame,您应该这样做:
stopwords='in the and an - '
sentence = sentence1+sentence2
df = pd.DataFrame([sentence.split()]).T
df.rename(columns={0: 'Words'}, inplace=True)
df = df.groupby(by=['Words'])['Words'].size().reset_index(name='counts')
df = df[~df['Words'].isin(stopwords.split())]
print(df)
Words counts
0 "" 1
2 [first] 2
3 [second] 1
4 error: 1
6 missing 1
7 text 2
9 word 2
10 wrong 1
I have rebuild it in a way you can add sentences and see the frequency growing我以一种你可以添加句子并看到频率增长的方式重建它
from collections import Counter
from collections import defaultdict
import pandas as pd
def terms_frequency(corpus, stop_words=None):
'''
Takes in texts and returns a pandas DataFrame of words frequency
'''
corpus_ = corpus.split()
# remove stop wors
terms = [word for word in corpus_ if word not in stop_words]
terms_freq = pd.DataFrame.from_dict(Counter(terms), orient='index').reset_index()
terms_freq = terms_freq.rename(columns={'index':'word', 0:'count'}).sort_values('count',ascending=False)
terms_freq.reset_index(inplace=True)
terms_freq.drop('index',axis=1,inplace=True)
return terms_freq
def get_sentence(sentence, storage, stop_words=None):
storage['sentences'].append(sentence)
corpus = ' '.join(s for s in storage['sentences'])
return terms_frequency(corpus,stop_words)
# tests
STOP_WORDS = 'in the and an - '
storage = defaultdict(list)
S1 = '[first] - missing "" in the text [first] word'
print(get_sentence(S1,storage,STOP_WORDS))
print('\nNext S2')
S2 = 'error: wrong word in the [second] text'
print(get_sentence(S2,storage,STOP_WORDS))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.