简体   繁体   English

如何从字典字符串中打印特定键。

[英]How to print a specific key from a string of dictionaries.

I wanted to get a one hot data based on the number of elements in the list when using sklearn transform. 我想在使用sklearn转换时根据列表中元素的数量获得一个热门数据。

Code: 码:

from sklearn.feature_extraction.text import CountVectorizer
from itertools import chain


x = [['1234', '5678', '910', 'baba'], ['8', '1'], 
     [], ['9', '3'], [], ['7', '6'], [], []]
vector = CountVectorizer(token_pattern=r".+",  min_df=1, max_df=1.0, lowercase=False,
                 max_features=None)
vec = [xxx for xx in x for xxx in xx]
vector.fit(chain.from_iterable([vec]))
print(vector.get_feature_names())
new = []
for xx in x:
    new.append(vector.transform(xx))
for x in new:
    for xx in x.toarray():
        print(xx)

Current output: 电流输出:

['1', '1234', '3', '5678', '6', '7', '8', '9', '910', 'baba']
[0 1 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0]
[0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 1 0 0 0]
[1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0]
[0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0]

My expected output: 我的预期输出:

['1', '1234', '3', '5678', '6', '7', '8', '9', '910', 'baba']
[0 1 0 1 0 0 0 0 1 1]
[1 0 0 0 0 0 1 0 0 0]
[0 0 1 0 0 0 0 1 0 0]
[0 0 0 0 1 1 0 0 0 0]

Is there a way to do it using my code? 有没有办法使用我的代码来做到这一点? I have tried to change it many times but unfortunately to no luck. 我尝试过多次更改,但是很遗憾,没有运气。 Somehow, my brain stops to process anything now. 不知何故,我的大脑现在停止处理任何东西。

You shouldn't need explicit for loops for this task. 您不需要为此任务使用显式的for循环。 You can use MultiLabelBinarizer instead, also from the sklearn library. 您也可以从sklearn库中使用MultiLabelBinarizer It doesn't handle empty lists, so just filter those out first. 它不会处理空列表,因此请先将其过滤掉。

Here's an example with Pandas: 这是熊猫的一个例子:

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

L = [['1234', '5678', '910', 'baba'], ['8', '1'], 
     [], ['9', '3'], [], ['7', '6'], [], []]

s = pd.Series(list(filter(None, L)))

mlb = MultiLabelBinarizer()

res = pd.DataFrame(mlb.fit_transform(s),
                   columns=mlb.classes_,
                   index=s.index)

print(res)

   1  1234  3  5678  6  7  8  9  910  baba
0  0     1  0     1  0  0  0  0    1     1
1  1     0  0     0  0  0  1  0    0     0
2  0     0  1     0  0  0  0  1    0     0
3  0     0  0     0  1  1  0  0    0     0

You can try of using intersect and np isin 您可以尝试使用相交np isin

intersect function will give closed elements and isin will create boolean list 相交函数将给出封闭元素,而isin将创建布尔列表

mask = ['1', '1234', '3', '5678', '6', '7', '8', '9', '910', 'baba']
for xx in x:
    if len(xx)>1:
        print(np.isin(mask,np.array(list(set(xx).intersection(set(mask))))).astype(int))

Out: 出:

[0 1 0 1 0 0 0 0 1 1]
[1 0 0 0 0 0 1 0 0 0]
[0 0 1 0 0 0 0 1 0 0]
[0 0 0 0 1 1 0 0 0 0]

Flattening the lists 整理列表

#if you have big lists of elements you can flatten by 
sum(x,[])

Out: 出:

['1234', '5678', '910', 'baba', '8', '1', '9', '3', '7', '6']

For future readers: 对于未来的读者:

I somehow solved it with a SUPER NAIVE way. 我以一种超级天真的方式解决了它。

Here is the codes: 这是代码:

from sklearn.feature_extraction.text import CountVectorizer from itertools import chain 来自sklearn.feature_extraction.text从itertools导入链中导入CountVectorizer

x = [['1234', '5678', '910', 'baba'], ['8', '1'], 
     [], ['9', '3'], [], ['7', '6'], [], []]
vector = CountVectorizer(token_pattern=r"\S*\d+\S*",  min_df=1, max_df=1.0, lowercase=False,
                 max_features=None)
vec = [xxx for xx in x for xxx in xx]
vector.fit(chain.from_iterable([vec]))
print(vector.get_feature_names())
new = []
for xx in x:
    new.append(" ".join(xx))

neww = vector.transform(new)

print(neww.toarray())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM