简体   繁体   English

在 pandas dataframe 中沿行更有效地填充列

[英]more efficient column filling along rows in pandas dataframe

I am building a matrix of 1 and -1 based on the words in an existing column.我正在根据现有列中的单词构建 1 和 -1 的矩阵。 It is for a basic neural.network.它用于基本的神经网络。 the data is setup as follows:数据设置如下:

data_table                                data_table_final
index            word_list                index            word_list   cat   hat  mouse  house  run into  hills your lives
    0   the cat in the hat                    0   the cat in the hat     1     1     -1     -1   -1   -1     -1    -1   -1
    1   mouse in the house                    1   mouse in the house    -1    -1      1      1   -1   -1     -1    -1   -1
    2   run into the hills         -->        2   run into the hills    -1    -1     -1     -1    1    1      1    -1   -1
    3   run for your lives                    3   run for your lives    -1    -1     -1     -1    1   -1     -1     1    1

to generate this dataframe I perform the following:生成这个 dataframe 我执行以下操作:

    word_list = ' '.join(sorted(data_table['world_list']))
    stop_words = ['the','and', 'a', 'in','is', 'to', 'at','by', '']

    pat = r'\b(?:{})\b'.format('|'.join(stop_words))
    word_list = re.sub(pat,'',word_list)

    word_list = word_list.split(' ')
    word_list_2 = dict.fromkeys(word_list)

    word_df = pd.DataFrame(-np.ones([len(data_table),len(word_list_2)]), columns=word_list_2)        
    if '' in word_df.keys():
        word_df = word_df.drop(columns=(''))
        
    word_df = pd.concat([data_table,word_df],1)

    for idx, it in data_table_final.iterrows():
        for word in it.word_list.split(" "):
            data_table_final.loc[idx,word] = 1

This is rather slow when the number of entries in the table grows.当表中的条目数增加时,这会相当慢。 And I think there should be a way to perform this without having to iterate over the rows of the dataframe. I thought about using zip but all instance I have used it point to multiple outputs as opposed to a table of multiple columns.而且我认为应该有一种方法可以执行此操作而不必遍历 dataframe 的行。我考虑过使用zip但我使用它的所有实例都指向多个输出而不是多列表。

Is there a more efficient way to perform the task without iterating over the table?有没有更有效的方法来执行任务而无需遍历表?

Scikit-learn has a CountVectorizer class that you can use for this purpose. Scikit-learn 有一个CountVectorizer class,您可以将其用于此目的。 By default, it returns the counts for each non stop-word as integers, but you can easily transform those into your 1/-1 encoding (I added the 'cat cat cat' string for testing):默认情况下,它以整数形式返回每个非停用词的计数,但您可以轻松地将它们转换为 1/-1 编码(我添加了 'cat cat cat' 字符串进行测试):

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

stop_words = ['the', 'and', 'a', 'in', 'is', 'to', 'at', 'by', 'for']
data_table = pd.Series(['the cat in the hat',
                        'mouse in the house',
                        'run into the hills',
                        'run for your lives',
                        'cat cat cat'],
                       name='word_list')

vectorizer = CountVectorizer(stop_words=stop_words)
X = vectorizer.fit_transform(data_table)

word_df = pd.concat([data_table, 
                     pd.DataFrame((X.toarray() > 0) * 2 - 1, 
                         columns=vectorizer.get_feature_names_out())], 
                    axis=1)
print(word_df)
            word_list  cat  hat  hills  house  into  lives  mouse  run  your
0  the cat in the hat    1    1     -1     -1    -1     -1     -1   -1    -1
1  mouse in the house   -1   -1     -1      1    -1     -1      1   -1    -1
2  run into the hills   -1   -1      1     -1     1     -1     -1    1    -1
3  run for your lives   -1   -1     -1     -1    -1      1     -1    1     1
4         cat cat cat    1   -1     -1     -1    -1     -1     -1   -1    -1

As you can see, the columns are ordered lexicographically by default.如您所见,列默认按字典顺序排列。

Not sure if it's faster but you can try using collection.Counter不确定它是否更快,但您可以尝试使用collection.Counter

from collections import Counter

df = pd.DataFrame([
    "the cat in the hat", 
    "mouse in the house",
    "run into the hills",
    "run for your lives"
])            
df2 = pd.DataFrame(
    df[0].map(
        lambda x: dict(Counter(x.split()))
    ).to_list()).fillna(-1)
df2[df2>0] = 1
print(df2)
   the  cat   in  hat  mouse  house  run  into  hills  for  your  lives
0  1.0  1.0  1.0  1.0   -1.0   -1.0 -1.0  -1.0   -1.0 -1.0  -1.0   -1.0
1  1.0 -1.0  1.0 -1.0    1.0    1.0 -1.0  -1.0   -1.0 -1.0  -1.0   -1.0
2  1.0 -1.0 -1.0 -1.0   -1.0   -1.0  1.0   1.0    1.0 -1.0  -1.0   -1.0
3 -1.0 -1.0 -1.0 -1.0   -1.0   -1.0  1.0  -1.0   -1.0  1.0   1.0    1.0

You may filter out unnecessary columns then merge with the original dataframe.您可以过滤掉不需要的列,然后与原始 dataframe 合并。

By similar way, you can use pd.unique + zip .通过类似的方式,您可以使用pd.unique + zip I think it could be slower as I've used map twice but there might be a faster way我认为它可能会更慢,因为我已经使用了两次 map 但可能有更快的方法

df2 = pd.DataFrame(df[0].map(
    lambda x: pd.unique(x.split())
).map(
    lambda x: dict(zip(x, [1]*len(x)))
).to_list()).fillna(-1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM