I'm new to python and pandas and I'm trying to manipulate a csv data file. I load two dataframes one contains a column with keywords and the other is a "bagOfWords" with "id" and "word" columns. What i whant to do is to add a column to the first dataframe with the ids of the keywords in a "list string" like so "[1,2,8,99 ...]".
This is what i have come up with so far
websitesAlchData = pd.io.parsers.read_csv('websitesAlchData.csv', sep=';', index_col='referer', encoding="utf-8")
bagOfWords = pd.io.parsers.read_csv('bagOfWords.csv', sep=';', header=0, names=["id","words","count"], encoding="utf-8")
a = set(bagOfWords['words'])
websitesAlchData['keywordIds'] = "[]"
for i in websitesAlchData.index
keywords = websitesAlchData.loc[i,'keywords']
try:
keywordsSet = set([ s.lower() for s in keywords.split(",") ])
except:
keywordsSet = set()
existingWords = a & keywordsSet
lista = []
for i in bagOfWords.index:
if bagOfWords.loc[i,'words'] in existingWords:
lista.append(bagOfWords.loc[i,'id'])
websitesAlchData.loc[i,'keywordIds'] = str(lista)
print(str(lista))
print(websitesAlchData.loc[i,'keywordIds'])
websitesAlchData.reset_index(inplace=True)
websitesAlchData.to_csv(path_or_buf = 'websitesAlchDataKeywordCode.csv', index=False, sep=";", encoding="utf-8")
The two prints at the end of the for loop give the excpected results but when I try to print the whole dataframe "websitesAlchData" the column "keywordIds" is still "[]" and so it is in the resulting .csv as well.
My guess would be that i create a copy somewhere but i can't se where.
Any ideas what is wrong here or how to do the same thing diffrently? Thanks!
UPDATE:
The websitesAlchData.cvs looks like this
referer;category;keywords
url;int;word0,word2,word3
url;int;word1,word3
...
And the bag of words cvc.
id;index;count
0;word0;11
1;word1;14
2;word2;14
3;word3;14
...
Expected output
referer;category;keywords;keywordIds
url;int;word0,word2,word3;[0,2,3]
url;int;word1,word3;[1,3]
there's definitely something wrong with using i
for both for
loops. change that and see if that helps.
I'd try something like this. You'll want to profile the performance on the larger dataset.
In [146]: df1
Out[146]:
referer category keywords
0 url int word0,word2,word3
1 url int word1,word3
[2 rows x 3 columns]
In [147]: df2
Out[147]:
id count
index
word0 0 11
word1 1 14
word2 2 14
word3 3 14
[4 rows x 2 columns]
Split the keywords
column into a list of words. Generally storing lists in DataFrames is a bad idea performance wise, but this is the most straightforward way for now.
In [148]: vals = df1.keywords.str.split(',')
In [149]: vals
Out[149]:
0 [word0, word2, word3]
1 [word1, word3]
Name: keywords, dtype: object
Then apply a lookup from df2
to each element of the lists in vals
:
In [151]: ids = vals.apply(lambda x: [df2.loc[y, 'id'] for y in x])
In [152]: ids
Out[152]:
0 [0, 2, 3]
1 [1, 3]
Name: keywords, dtype: object
Finally concat:
In [154]: df = pd.concat([df1, ids], axis=1)
In [155]: df
Out[155]:
referer category keywords keywords
0 url int word0,word2,word3 [0, 2, 3]
1 url int word1,word3 [1, 3]
[2 rows x 4 columns]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.