简体   繁体   English

Power BI 中的 Pandas astype() 错误,但 Jupyter Notebook 中没有

[英]Pandas astype() error in Power BI but not in Jupyter Notebook

I have the following topic modelling script to assign topic categories to a variety of documents.我有以下主题建模脚本来将主题类别分配给各种文档。 The documents are imported through Power BI via df = dataset['Comment']文档通过 Power BI 通过df = dataset['Comment']导入

import csv
data_text = pd.DataFrame(df,columns=['text'])

# set the number of topics 
total_topics = 3

# process the data
from nltk.tokenize import word_tokenize
from collections import defaultdict
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from gensim.parsing.preprocessing import remove_stopwords
from nltk.corpus import stopwords

data_text = pd.DataFrame(df,columns=['text'])
# remove stopwords and tokenize the text
custom_stops = ["stopword1", "stopword2", "stopword3"]
data_text['filtered_text'] = data_text['text'].apply(lambda x: remove_stopwords(x.lower()))
data_text['filtered_text'] = data_text['filtered_text'].apply(lambda x: str.split(x))
data_text['filtered_text'] = data_text['filtered_text'].apply(lambda x: [item for item in x if item.lower() not in custom_stops])
CORPUS = pd.DataFrame(data_text['filtered_text'])

# Remove empty strings
CORPUS.dropna(inplace=True)
# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV
# lemmatize the text
for index,entry in enumerate(CORPUS['filtered_text']):
    # Declaring Empty List to store the words that follow the rules for this step
    Final_words = []
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for Stop words and consider only alphabets
        if word not in stopwords.words('english') and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
    # The final processed set of words for each iteration will be stored in 'text_final'
    CORPUS.loc[index,'text_final'] = str(Final_words)

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

def build_feature_matrix(documents, feature_type='frequency'):
    feature_type = feature_type.lower().strip()  
    if feature_type == 'binary':
        vectorizer = CountVectorizer(binary=True, min_df=1,ngram_range=(1, 1))
    elif feature_type == 'frequency':
        vectorizer = CountVectorizer(binary=False, min_df=1,ngram_range=(1, 1))
    elif feature_type == 'tfidf':
        vectorizer = TfidfVectorizer(min_df=1, ngram_range=(1, 1))
    else:
        raise Exception("Wrong feature type entered. Possible values: 'binary', 'frequency', 'tfidf'")
    feature_matrix = vectorizer.fit_transform(documents).astype(float)
    return vectorizer, feature_matrix

# create a feature matrix
# *******HERE IS WHERE THE DATATYPE ERROR OCCURS IN POWER BI*********
    vectorizer, tfidf_matrix = build_feature_matrix(CORPUS['text_final'], feature_type='tfidf')
    td_matrix = tfidf_matrix.transpose()
    td_matrix = td_matrix.multiply(td_matrix > 0)
    
    from sklearn.decomposition import NMF
    nmf = NMF(n_components=total_topics, random_state=42, alpha=.1, l1_ratio=.5)
    nmf.fit(tfidf_matrix) 
    
    def get_topics_terms_weights(weights, feature_names):
        feature_names = np.array(feature_names)
        sorted_indices = np.array([list(row[::-1]) 
                               for row 
                               in np.argsort(np.abs(weights))])
        sorted_weights = np.array([list(wt[index]) 
                                   for wt, index 
                                   in zip(weights,sorted_indices)])
        sorted_terms = np.array([list(feature_names[row]) 
                                 for row 
                                 in sorted_indices])
        
        topics = [np.vstack((terms.T, 
                         term_weights.T)).T 
                  for terms, term_weights 
                  in zip(sorted_terms, sorted_weights)]     
        return topics
    
    def print_topics_udf(topics, total_topics=1,
                         weight_threshold=0.0001,
                         display_weights=False,
                         num_terms=None):
        
        for index in range(total_topics):
            topic = topics[index]
            topic = [(term, float(wt))
                     for term, wt in topic]
            topic = [(word, round(wt,2)) 
                     for word, wt in topic 
                     if abs(wt) >= weight_threshold]
                         
            if display_weights:
                print( 'Topic #' +str(index+1)+' with weights')
                print (topic[:num_terms] if num_terms else topic)
            else:
                print ('Topic #'+str(index+1)+' without weights')
                tw = [term for term, wt in topic]
                print (tw[:num_terms] if num_terms else tw)
            print()
    
    feature_names = vectorizer.get_feature_names()
    weights = nmf.components_
    
    topics = get_topics_terms_weights(weights, feature_names)
    # print topics and weights
    # print_topics_udf(topics=topics,total_topics=total_topics,num_terms=None,display_weights=False) 
    # print topics with weights
    # print_topics_udf(topics=topics,total_topics=total_topics,num_terms=None,display_weights=True) 
    
    # display the topics
    # this takes the top term from each group and assigns it as the topic theme
    for index in range(0,total_topics):
            print("Topic",index+1,"=",topics[index][0][0])
    
    # NMF definition: slide 25 of http://derekgreene.com/slides/topic-modelling-with-scikitlearn.pdf
    # [comments x terms] -> NMF = [comments x topic] x [topics x terms]
    # = [A] -> NMF = [W] x [H]
    # i.e.:
    # NMF applied to matrix [A] yields [W] x [H]
    # since we want to associate the comments with the topics we want [W]
    # W is the matrix that is defined by comments x topics as rows x columns
    # Thus, finding the maximum value in the array column would equate to the assigned topic
    W = nmf.fit_transform(tfidf_matrix)
    # e.g., W[5] gives the largest value in the first column. Thus the first topic (e.g., "reporting") is assigned
    # topics[np.argmax(W[5])][0][0] to see the assignment
    # Unprocessed text from data_text['text'].iloc[5] is "Items are broken"
    
    # assign the topics to based on the decomposition matrices
    data_text['topic'] = ""
    for row in range(len(data_text['text'])):
        data_text['topic'][row] = topics[np.argmax(W[row])][0][0]

   

This works when I run it in a Jupyter Notebook and import the.csv file directly with:当我在 Jupyter Notebook 中运行它并直接导入 .csv 文件时,此方法有效:

    with open('C:\\...\\comments.csv', newline='') as f:
        reader = csv.reader(f)
        next(reader) # skip header
        df = [tuple(row) for row in reader]

However, when I use df = dataset['Comment'] in Power BI, I get the following error:但是,当我在 Power BI 中使用df = dataset['Comment']时,出现以下错误:

DtypeWarning: Columns (10,13) have mixed types.Specify dtype option on import or set low_memory=False.

I've tried casting to strings using the astypes() function but I get the same error.我尝试使用astypes() function 转换为字符串,但我得到了同样的错误。

The issue is related to the way datasets are imported in the Power BI Query Editor using Python.该问题与使用 Python 在 Power BI 查询编辑器中导入数据集的方式有关。 To fix the issue, import the data via:要解决此问题,请通过以下方式导入数据:

# import from 'Comment' the Power BI dataset
df = pd.DataFrame(dataset.loc[:,'Comment'])

# create the dataframe used in the pre-processing steps
data_text = pd.DataFrame(df,columns=['Comment'])
# rename the 'Comment' column 'text'
data_text.rename(columns={'Comment':'text'}, inplace=True)

Unlike using R scripts, you can't import with df=dataset['Comment'] as suggested in the question.与使用 R 脚本不同,您不能按照问题中的建议使用df=dataset['Comment']导入。 Without doing the above, the df dataframe is null, which throws the dtype errors on methods like lower()如果不执行上述操作,则df dataframe 为 null,这会在lower()等方法上引发dtype错误

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM