简体   繁体   English

Python Pandas DataFrame单元更改消失

[英]Python Pandas DataFrame cell changes disappear

I'm new to python and pandas and I'm trying to manipulate a csv data file. 我是python和pandas的新手,正在尝试处理csv数据文件。 I load two dataframes one contains a column with keywords and the other is a "bagOfWords" with "id" and "word" columns. 我加载两个数据框,一个包含带有关键字的列,另一个是带有“ id”和“ word”列的“ bagOfWords”。 What i whant to do is to add a column to the first dataframe with the ids of the keywords in a "list string" like so "[1,2,8,99 ...]". 我想做的是在第一个数据帧中添加一列,其中关键字的ID位于“列表字符串”中,例如“ [1,2,8,99 ...]”。

This is what i have come up with so far 到目前为止,这是我想出的

websitesAlchData = pd.io.parsers.read_csv('websitesAlchData.csv', sep=';', index_col='referer', encoding="utf-8")

bagOfWords = pd.io.parsers.read_csv('bagOfWords.csv', sep=';', header=0, names=["id","words","count"], encoding="utf-8")
a = set(bagOfWords['words'])
websitesAlchData['keywordIds'] = "[]"
for i in websitesAlchData.index
    keywords = websitesAlchData.loc[i,'keywords']
    try:
        keywordsSet = set([ s.lower() for s in keywords.split(",") ])
    except:
        keywordsSet = set()
    existingWords = a & keywordsSet
    lista = []
    for i in bagOfWords.index:
        if bagOfWords.loc[i,'words'] in existingWords:
            lista.append(bagOfWords.loc[i,'id'])

    websitesAlchData.loc[i,'keywordIds'] = str(lista)
    print(str(lista))
    print(websitesAlchData.loc[i,'keywordIds'])
websitesAlchData.reset_index(inplace=True)
websitesAlchData.to_csv(path_or_buf = 'websitesAlchDataKeywordCode.csv', index=False, sep=";", encoding="utf-8")

The two prints at the end of the for loop give the excpected results but when I try to print the whole dataframe "websitesAlchData" the column "keywordIds" is still "[]" and so it is in the resulting .csv as well. for循环末尾的两次打印给出了预期的结果,但是当我尝试打印整个数据框“ websitesAlchData”时,“ keywordIds”列仍为“ []”,因此它也位于结果.csv中。

My guess would be that i create a copy somewhere but i can't se where. 我的猜测是,我在某个地方创建了一个副本,但我无法确定在哪里。

Any ideas what is wrong here or how to do the same thing diffrently? 任何想法在这里有什么问题,或者如何不同地做同一件事? Thanks! 谢谢!

UPDATE: 更新:

The websitesAlchData.cvs looks like this 网站AlchData.cvs看起来像这样

referer;category;keywords
url;int;word0,word2,word3
url;int;word1,word3
...

And the bag of words cvc. 和cvc的话袋。

id;index;count
0;word0;11
1;word1;14
2;word2;14
3;word3;14
...

Expected output 预期产量

referer;category;keywords;keywordIds
url;int;word0,word2,word3;[0,2,3]
url;int;word1,word3;[1,3]

there's definitely something wrong with using i for both for loops. i用于两个for循环肯定存在问题。 change that and see if that helps. 改变它,看看是否有帮助。

I'd try something like this. 我会尝试这样的事情。 You'll want to profile the performance on the larger dataset. 您需要分析较大数据集上的性能。

In [146]: df1
Out[146]: 
  referer category           keywords
0     url      int  word0,word2,word3
1     url      int        word1,word3

[2 rows x 3 columns]

In [147]: df2
Out[147]: 
       id  count
index           
word0   0     11
word1   1     14
word2   2     14
word3   3     14

[4 rows x 2 columns]

Split the keywords column into a list of words. keywords列拆分为单词列表。 Generally storing lists in DataFrames is a bad idea performance wise, but this is the most straightforward way for now. 通常,将列表存储在DataFrames中是一个糟糕的主意,但这是目前最直接的方法。

In [148]: vals = df1.keywords.str.split(',')

In [149]: vals
Out[149]: 
0    [word0, word2, word3]
1           [word1, word3]
Name: keywords, dtype: object

Then apply a lookup from df2 to each element of the lists in vals : 然后从df2vals中的列表的每个元素应用查找:

In [151]: ids = vals.apply(lambda x: [df2.loc[y, 'id'] for y in x])

In [152]: ids
Out[152]: 
0    [0, 2, 3]
1       [1, 3]
Name: keywords, dtype: object

Finally concat: 最后连拍:

In [154]: df = pd.concat([df1, ids], axis=1)

In [155]: df
Out[155]: 
  referer category           keywords   keywords
0     url      int  word0,word2,word3  [0, 2, 3]
1     url      int        word1,word3     [1, 3]

[2 rows x 4 columns]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM