Python Pandas DataFrame单元更改消失

Question

I'm new to python and pandas and I'm trying to manipulate a csv data file. 我是python和pandas的新手，正在尝试处理csv数据文件。 I load two dataframes one contains a column with keywords and the other is a "bagOfWords" with "id" and "word" columns. 我加载两个数据框，一个包含带有关键字的列，另一个是带有“ id”和“ word”列的“ bagOfWords”。 What i whant to do is to add a column to the first dataframe with the ids of the keywords in a "list string" like so "[1,2,8,99 ...]". 我想做的是在第一个数据帧中添加一列，其中关键字的ID位于“列表字符串”中，例如“ [1,2,8,99 ...]”。

This is what i have come up with so far 到目前为止，这是我想出的

websitesAlchData = pd.io.parsers.read_csv('websitesAlchData.csv', sep=';', index_col='referer', encoding="utf-8")

bagOfWords = pd.io.parsers.read_csv('bagOfWords.csv', sep=';', header=0, names=["id","words","count"], encoding="utf-8")
a = set(bagOfWords['words'])
websitesAlchData['keywordIds'] = "[]"
for i in websitesAlchData.index
    keywords = websitesAlchData.loc[i,'keywords']
    try:
        keywordsSet = set([ s.lower() for s in keywords.split(",") ])
    except:
        keywordsSet = set()
    existingWords = a & keywordsSet
    lista = []
    for i in bagOfWords.index:
        if bagOfWords.loc[i,'words'] in existingWords:
            lista.append(bagOfWords.loc[i,'id'])

    websitesAlchData.loc[i,'keywordIds'] = str(lista)
    print(str(lista))
    print(websitesAlchData.loc[i,'keywordIds'])
websitesAlchData.reset_index(inplace=True)
websitesAlchData.to_csv(path_or_buf = 'websitesAlchDataKeywordCode.csv', index=False, sep=";", encoding="utf-8")

The two prints at the end of the for loop give the excpected results but when I try to print the whole dataframe "websitesAlchData" the column "keywordIds" is still "[]" and so it is in the resulting .csv as well. for循环末尾的两次打印给出了预期的结果，但是当我尝试打印整个数据框“ websitesAlchData”时，“ keywordIds”列仍为“ []”，因此它也位于结果.csv中。

My guess would be that i create a copy somewhere but i can't se where. 我的猜测是，我在某个地方创建了一个副本，但我无法确定在哪里。

Any ideas what is wrong here or how to do the same thing diffrently? 任何想法在这里有什么问题，或者如何不同地做同一件事？ Thanks! 谢谢！

UPDATE: 更新：

The websitesAlchData.cvs looks like this 网站AlchData.cvs看起来像这样

referer;category;keywords
url;int;word0,word2,word3
url;int;word1,word3
...

And the bag of words cvc. 和cvc的话袋。

id;index;count
0;word0;11
1;word1;14
2;word2;14
3;word3;14
...

Expected output 预期产量

referer;category;keywords;keywordIds
url;int;word0,word2,word3;[0,2,3]
url;int;word1,word3;[1,3]

Answer 1

there's definitely something wrong with using i for both for loops. 将i用于两个for循环肯定存在问题。 change that and see if that helps. 改变它，看看是否有帮助。

Answer 2

I'd try something like this. 我会尝试这样的事情。 You'll want to profile the performance on the larger dataset. 您需要分析较大数据集上的性能。

In [146]: df1
Out[146]: 
  referer category           keywords
0     url      int  word0,word2,word3
1     url      int        word1,word3

[2 rows x 3 columns]

In [147]: df2
Out[147]: 
       id  count
index           
word0   0     11
word1   1     14
word2   2     14
word3   3     14

[4 rows x 2 columns]

Split the keywords column into a list of words. 将keywords列拆分为单词列表。 Generally storing lists in DataFrames is a bad idea performance wise, but this is the most straightforward way for now. 通常，将列表存储在DataFrames中是一个糟糕的主意，但这是目前最直接的方法。

In [148]: vals = df1.keywords.str.split(',')

In [149]: vals
Out[149]: 
0    [word0, word2, word3]
1           [word1, word3]
Name: keywords, dtype: object

Then apply a lookup from df2 to each element of the lists in vals : 然后从df2到vals中的列表的每个元素应用查找：

In [151]: ids = vals.apply(lambda x: [df2.loc[y, 'id'] for y in x])

In [152]: ids
Out[152]: 
0    [0, 2, 3]
1       [1, 3]
Name: keywords, dtype: object

Finally concat: 最后连拍：

In [154]: df = pd.concat([df1, ids], axis=1)

In [155]: df
Out[155]: 
  referer category           keywords   keywords
0     url      int  word0,word2,word3  [0, 2, 3]
1     url      int        word1,word3     [1, 3]

[2 rows x 4 columns]

Python Pandas DataFrame单元更改消失

问题描述

2 个解决方案

解决方案1
0 已采纳 2014-03-03 14:42:35

解决方案2
0 2014-03-03 14:43:42

Python Pandas DataFrame单元更改消失

问题描述

2 个解决方案

解决方案1 0 已采纳 2014-03-03 14:42:35

解决方案2 0 2014-03-03 14:43:42

解决方案1
0 已采纳 2014-03-03 14:42:35

解决方案2
0 2014-03-03 14:43:42