[英]more efficient column filling along rows in pandas dataframe
I am building a matrix of 1 and -1 based on the words in an existing column.我正在根据现有列中的单词构建 1 和 -1 的矩阵。 It is for a basic neural.network.
它用于基本的神经网络。 the data is setup as follows:
数据设置如下:
data_table data_table_final
index word_list index word_list cat hat mouse house run into hills your lives
0 the cat in the hat 0 the cat in the hat 1 1 -1 -1 -1 -1 -1 -1 -1
1 mouse in the house 1 mouse in the house -1 -1 1 1 -1 -1 -1 -1 -1
2 run into the hills --> 2 run into the hills -1 -1 -1 -1 1 1 1 -1 -1
3 run for your lives 3 run for your lives -1 -1 -1 -1 1 -1 -1 1 1
to generate this dataframe I perform the following:生成这个 dataframe 我执行以下操作:
word_list = ' '.join(sorted(data_table['world_list']))
stop_words = ['the','and', 'a', 'in','is', 'to', 'at','by', '']
pat = r'\b(?:{})\b'.format('|'.join(stop_words))
word_list = re.sub(pat,'',word_list)
word_list = word_list.split(' ')
word_list_2 = dict.fromkeys(word_list)
word_df = pd.DataFrame(-np.ones([len(data_table),len(word_list_2)]), columns=word_list_2)
if '' in word_df.keys():
word_df = word_df.drop(columns=(''))
word_df = pd.concat([data_table,word_df],1)
for idx, it in data_table_final.iterrows():
for word in it.word_list.split(" "):
data_table_final.loc[idx,word] = 1
This is rather slow when the number of entries in the table grows.当表中的条目数增加时,这会相当慢。 And I think there should be a way to perform this without having to iterate over the rows of the dataframe. I thought about using
zip
but all instance I have used it point to multiple outputs as opposed to a table of multiple columns.而且我认为应该有一种方法可以执行此操作而不必遍历 dataframe 的行。我考虑过使用
zip
但我使用它的所有实例都指向多个输出而不是多列表。
Is there a more efficient way to perform the task without iterating over the table?有没有更有效的方法来执行任务而无需遍历表?
Scikit-learn has a CountVectorizer
class that you can use for this purpose. Scikit-learn 有一个
CountVectorizer
class,您可以将其用于此目的。 By default, it returns the counts for each non stop-word as integers, but you can easily transform those into your 1/-1 encoding (I added the 'cat cat cat' string for testing):默认情况下,它以整数形式返回每个非停用词的计数,但您可以轻松地将它们转换为 1/-1 编码(我添加了 'cat cat cat' 字符串进行测试):
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
stop_words = ['the', 'and', 'a', 'in', 'is', 'to', 'at', 'by', 'for']
data_table = pd.Series(['the cat in the hat',
'mouse in the house',
'run into the hills',
'run for your lives',
'cat cat cat'],
name='word_list')
vectorizer = CountVectorizer(stop_words=stop_words)
X = vectorizer.fit_transform(data_table)
word_df = pd.concat([data_table,
pd.DataFrame((X.toarray() > 0) * 2 - 1,
columns=vectorizer.get_feature_names_out())],
axis=1)
print(word_df)
word_list cat hat hills house into lives mouse run your
0 the cat in the hat 1 1 -1 -1 -1 -1 -1 -1 -1
1 mouse in the house -1 -1 -1 1 -1 -1 1 -1 -1
2 run into the hills -1 -1 1 -1 1 -1 -1 1 -1
3 run for your lives -1 -1 -1 -1 -1 1 -1 1 1
4 cat cat cat 1 -1 -1 -1 -1 -1 -1 -1 -1
As you can see, the columns are ordered lexicographically by default.如您所见,列默认按字典顺序排列。
Not sure if it's faster but you can try using collection.Counter不确定它是否更快,但您可以尝试使用collection.Counter
from collections import Counter
df = pd.DataFrame([
"the cat in the hat",
"mouse in the house",
"run into the hills",
"run for your lives"
])
df2 = pd.DataFrame(
df[0].map(
lambda x: dict(Counter(x.split()))
).to_list()).fillna(-1)
df2[df2>0] = 1
print(df2)
the cat in hat mouse house run into hills for your lives
0 1.0 1.0 1.0 1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0
1 1.0 -1.0 1.0 -1.0 1.0 1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0
2 1.0 -1.0 -1.0 -1.0 -1.0 -1.0 1.0 1.0 1.0 -1.0 -1.0 -1.0
3 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 1.0 -1.0 -1.0 1.0 1.0 1.0
You may filter out unnecessary columns then merge with the original dataframe.您可以过滤掉不需要的列,然后与原始 dataframe 合并。
By similar way, you can use pd.unique + zip
.通过类似的方式,您可以使用
pd.unique + zip
。 I think it could be slower as I've used map twice but there might be a faster way我认为它可能会更慢,因为我已经使用了两次 map 但可能有更快的方法
df2 = pd.DataFrame(df[0].map(
lambda x: pd.unique(x.split())
).map(
lambda x: dict(zip(x, [1]*len(x)))
).to_list()).fillna(-1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.