简体   繁体   中英

Create a matrix of words occurring in a Pandas data frame with text strings

I have a Pandas data frame with a column of text data. I want to compare each row of this text data with a list of words that I'm interested in. The comparison should result in a matrix that shows the occurrence of the word (0 or 1) in the text of that row of data.

Input data frame:

text
That bear talks
The stone rocks
Tea is boiling
The bear drinks tea

Input list of words:

[bear, talks, tea]

Result:

text                 bear  talks  tea
That bear talks      1     1      0
The stone rocks      0     0      0
Tea is boiling       0     0      1
The bear drinks tea  1     0      1

I found some information on sklearn.feature_extraction.text.HashingVectorizer but from what I understand it just takes the whole data frame and breaks it down in component words and counts those. What I want to do is do it on a very limited list.

With sklearn I did the following:

from sklearn.feature_extraction.text import HashingVectorizer

countvec = HashingVectorizer()

countvec.fit_transform(resultNLdf2.text)

But that gives me the following:

<73319x1048576 sparse matrix of type '<class 'numpy.float64'>'
    with 1105683 stored elements in Compressed Sparse Row format>

Which seems a bit big to work with unless I could select on the words I want from this sparse matrix but I don't know how to work with it.

I'm sorry if I used the wrong words to explain this problem, not sure if you would call this a matrix for example.

edit

The true data i'm working on is rather large, 1264555 rows with strings of tweets. At least i've learned not to over simplify a problem :-p. This makes some of the given solutions (Thanks for trying to help!!) not work because of memory issues or just being extremely slow. This was also a reason I was looking at sklearn.

with:

from sklearn.feature_extraction.text import CountVectorizer

words = ['bear', 'talks', 'tea']

countvec = CountVectorizer(vocabulary=words)

countvec.fit_transform(resultNLdf2.text)

you can actually limit the words you want to look at by giving a simple list. But this leaves me with the problem that it is in a format I'm not sure what to do with as described above.

You can use use Series.str.get_dummies

>>> print df.join(df.text.str.get_dummies(' ').loc[:, ['bear', 'talks', 'tea']])
                 text  bear  talks  tea
0      That bear talks     1      1    0
1      The stone rocks     0      0    0
2       Tea is boiling     0      0    0
3  The bear drinks tea     1      0    1

After testing with the solutions given to my initial question, I wanted to stick with sklearn because it seems extremely fast and seems to have no problems with the considerable amount of data I'm working with. To stick with the 'bear, talks, tea' example here is the solution I'm working with now:

from sklearn.feature_extraction.text import CountVectorizer

df = pd.DataFrame(["That bear talks", "The stone rocks", "Tea is boiling", "The bear drinks tea"], columns=['text'])

words = ['bear', 'talks', 'tea']

countvec = CountVectorizer(vocabulary=words)

dfFinal = pd.DataFrame(countvec.fit_transform(df.text).toarray(), index=df.text, columns=countvec.get_feature_names())

Of course I'm still interested to hear why this or other solutions are good or about things I should take into consideration.

由于您有一个有限的列表,您可以遍历列表中的单词并为每个单词执行此操作:

df['bear'] = df['text'].str.contains('bear')

You can use python string count for this.

import pandas as pd

x= ["That bear talks","The stone rocks","Tea is boiling","The bear drinks tea"]
words = ['bear', 'talks', 'tea']

out=pd.DataFrame(index=x,columns=words)

for i in range(0,out.shape[0]):
    for word in words:
        out.ix[i,str(word)]= out.index[i].count(str(word))

print(out)

                    bear talks tea
That bear talks        1     1   0
The stone rocks        0     0   0
Tea is boiling         0     0   0
The bear drinks tea    1     0   1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM