带有 json 数组的词袋

Question

I'm trying to follow this tutorial in order to make a custom bag of words.我正在尝试按照本教程制作自定义词袋。

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
'All my cats in a row',
    'When my cat sits down, she looks like a Furby toy!',
    'The cat from outer space',
    'Sunshine loves to sit like this for some reason.'
]
vectorizer = CountVectorizer()
print( vectorizer.fit_transform(corpus).todense() )
print( vectorizer.vocabulary_ )

This script print that:这个脚本打印：

[[1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 1]
 [0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0]
 [0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 0 1 1 0 0]]
{u'all': 0, u'sunshine': 20, u'some': 18, u'down': 3, u'reason': 13, u'looks': 9, u'in': 7, u'outer': 12, u'sits': 17, u'row': 14, u'toy': 24, u'from': 5, u'like': 8, u'for': 4, u'space': 19, u'this': 22, u'sit': 16, u'when': 25, u'cat': 1, u'to': 23, u'cats': 2, u'she': 15, u'loves': 10, u'furby': 6, u'the': 21, u'my': 11}

So here's my problem: I have a json file with this data structure:所以这是我的问题：我有一个带有这种数据结构的 json 文件：

[
    {
        "id": "1",
        "class": "positive",
        "tags": [
            "tag1",
            "tag2"
        ]
    },
    {
        "id": "2",
        "class": "negative",
        "tags": [
            "tag1",
            "tag3"
        ]
    }
]

So I'm trying to vectorize the tags array without any success.所以我试图对标签数组进行矢量化，但没有成功。

I've tried something like this:我试过这样的事情：

data = json.load(open('data.json'));
print( vectorizer.fit_transform(data).todense() )

also:还：

for element in data:
print( vectorizer.fit_transform(element).todense() ) 
#or 
print( vectorizer.fit_transform(element['tags']).todense() )

nobody works.没有人工作。 Any ideas?有任何想法吗？

Answer 1

1. Use pandas to read the json file into a `DataFrame` 1.使用pandas将json文件读入`DataFrame`

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

df = pd.read_json('data.json', orient='values')
print(df)

This is what your DataFrame should look like:这是您的DataFrame样子：

Out[]:       
      class  id          tags
0  positive   1  [tag1, tag2]
1  negative   2  [tag1, tag3]

2. Convert the tags column from `list` to `str` 2.将标签列从`list`转换为`str`

df['tags'] = df['tags'].apply(lambda x: ' '.join(x))
print(df)

This will result in converting tags to space separated strings:这将导致将tags转换为空格分隔的字符串：

Out[]:       
class  id       tags
0  positive   1  tag1 tag2
1  negative   2  tag1 tag3

3. Plug the `tags` column / pandas `Series` into `CountVectorizer` 3. 将`tags`列/pandas `Series`插入`CountVectorizer`

vectorizer = CountVectorizer()
print(vectorizer.fit_transform(df['tags']).todense())
print(vectorizer.vocabulary_)

This will result in the output that you want:这将导致您想要的输出：

Out[]:       
[[1 1 0]
 [1 0 1]]
{'tag1': 0, 'tag2': 1, 'tag3': 2}

带有 json 数组的词袋

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-02-15 11:42:48

1. Use pandas to read the json file into a `DataFrame` 1.使用pandas将json文件读入`DataFrame`

2. Convert the tags column from `list` to `str` 2.将标签列从`list`转换为`str`

3. Plug the `tags` column / pandas `Series` into `CountVectorizer` 3. 将`tags`列/pandas `Series`插入`CountVectorizer`

带有 json 数组的词袋

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-02-15 11:42:48

1. Use pandas to read the json file into a DataFrame 1.使用pandas将json文件读入DataFrame

2. Convert the tags column from list to str 2.将标签列从list转换为str

3. Plug the tags column / pandas Series into CountVectorizer 3. 将tags列/pandas Series插入CountVectorizer

解决方案1
1 已采纳 2018-02-15 11:42:48

1. Use pandas to read the json file into a `DataFrame` 1.使用pandas将json文件读入`DataFrame`

2. Convert the tags column from `list` to `str` 2.将标签列从`list`转换为`str`

3. Plug the `tags` column / pandas `Series` into `CountVectorizer` 3. 将`tags`列/pandas `Series`插入`CountVectorizer`