简体   繁体   English

在python字典中重组输出数据

[英]Reorganizing output data in python dictionary

I need to create sparse vectors and I would like to try it using python . 我需要创建稀疏向量,我想使用python尝试一下。 I have all of the data needed already to create the vectors, so my task is basically reformatting/rearranging the information that I have. 我已经拥有创建向量所需的所有数据,因此我的任务基本上是重新格式化/重新排列我拥有的信息。

The input file I have is a 5GB file with 3 tab-separated columns, for example: 我拥有的输入文件是一个5GB的文件,具有3个制表符分隔的列,例如:

abandonment-n   about+n-the+v-know-v    1
abandonment-n   above+ns-j+vn-pass-continue-v   1
abandonment-n   after+n-the+n-a-j-stop-n    1
abandonment-n   as+n-the+ns-j-aid-n 1
cake-n  against+n-the+vg-restv  1
cake-n  as+n-a+vd-require-v 1
cake-n  as+n-a-j+vg-up-use-v    1
cake-n  as+n-the+ns-j-aid-n 2
dog-n   as+n-a-j+vg-up-use-v    7
dog-n   as+n-the+ns-j-aid-n 5

My desired output is the following 我想要的输出如下

2   7
1   1   1   1
1   1   1   2
7   5

where, the first line specifies the dimensions (essentially unique row // col) and the second line begins the actual matrix, in sparse format. 其中,第一行指定维数(基本上是唯一的行// col),第二行以稀疏格式开始实际的矩阵。

I think the most effective way to do this would be in python . 我认为最有效的方法是在python However, as I already have calculated he corresponding weights of the data, I do not think that the classes in numpy or for vectors, such as found here and here are necessary in this case. 但是,由于我已经计算出数据的相应权重,因此在这种情况下,我认为numpy或用于向量的类(例如在此处此处找到的类)不是必需的。 So, does anyone have any insight how I can begin to tackle this rearranging problem in python? 那么,有没有人有任何见识可以帮助我开始解决python中的这种重排问题?

The first thing that I have thought to do is open the file and split the elements in a dictionary: like this: 我想做的第一件事是打开文件并在字典中拆分元素:

mydict = {}
with open("sample_outputDM_ALL_COOC", 'r') as infile_A:
    for line in infile_A:
        lines_splitted = line.split()
        lemma = lines_splitted[0]
        feat = lines_splitted[1]
        weight = lines_splitted[2]
        mydict = [lemma], float(weight)
        #print mydict

        for x in mydict:
            if lemma == lemma:
                print weight + "\t"
            else:
                pass

I have been working very hard on solving this problem and I still have not been able to. 我一直在努力解决这个问题,但仍然没有能够做到。 What I have done until now is input all of the variables into a dictionary and I am able to print the each individual lemma and each individual weight per row. 到目前为止,我所做的就是将所有变量输入到字典中,并且我能够按行打印每个引理和每个权重。

However, I need to have all of weights corresponding to a given lemma in the same row. 但是,我需要在同一行中具有与给定引理相对应的所有权重。 I have tried the groupby variable, but I am not sure that it is the best option for this case. 我已经尝试了groupby变量,但是不确定这种情况下的最佳选择。 I believe that the solution if the for if else statement, but I cant figure out how to link the two. 我相信解决方案,如果for if else语句,但我不知道如何链接两者。

Thus, the method should be along the lines of: for every target , print freq of slotfiller in one row for each unique target . 因此,该方法应遵循以下原则:对于每个target ,为每个唯一target在一行中打印slotfiller freq

Is this for homework? 这是做作业吗? If not, check out the tools available in scipy.sparse or a mixture of scikits.learn and Python NLTK ( eg this example ). 如果不是,请查看scipy.sparse可用的工具或scikits.learnPython NLTK的混合使用( 例如,本示例 )。

Added Based on the comment and re-reading the question, I can also imagine using Pandas.DataFrame to accomplish this, but I am not sure if it will be satisfactory given the size of the data. 添加基于注释和重新阅读的问题,我也能想象用Pandas.DataFrame做到这一点,但我不知道这是否是满意给出的数据的大小。 One option would be to load the data in multiple chunks, since it seems to be parallelizable on unique items of the first column. 一种选择是将数据加载为多个块,因为它似乎可以在第一列的唯一项上并行化。 (See my comment below for more on that). (有关更多信息,请参见下面的我的评论 )。

def sparse_vec(df):
    return (df['Col3'].values[None,:],)

# Obviously these would be chunk-specific, and you'd need to do
# another pass to get the global sum of unique ids from Col1 and the
# global max of the number of unique rows-per-id.
n_cols = len(df.Col2.unique())
n_rows = len(df.Col1.unique())


vecs = df.groupby("Col1").apply(sparse_vec)
print vecs

Using this on the sample data you gave, in IPython, I see this: 在您给的示例数据中使用此代码,在IPython中,我看到以下内容:

In [17]: data = """
   ....: abandonment-n   about+n-the+v-know-v    1
   ....: abandonment-n   above+ns-j+vn-pass-continue-v   1
   ....: abandonment-n   after+n-the+n-a-j-stop-n    1
   ....: abandonment-n   as+n-the+ns-j-aid-n 1
   ....: cake-n  against+n-the+vg-restv  1
ake-   ....: cake-n  as+n-a+vd-require-v 1
   ....: cake-n  as+n-a-j+vg-up-use-v    1
   ....: cake-n  as+n-the+ns-j-aid-n 2
   ....: dog-n   as+n-a-j+vg-up-use-v    7
dog-   ....: dog-n   as+n-the+ns-j-aid-n 5"""

In [18]: data
Out[18]: '\nabandonment-n   about+n-the+v-know-v    1\nabandonment-n   above+ns-j+vn-pass-continue-v   1\nabandonment-n   after+n-the+n-a-j-stop-n    1\nabandonment-n   as+n-the+ns-j-aid-n 1\ncake-n  against+n-the+vg-restv  1\ncake-n  as+n-a+vd-require-v 1\ncake-n  as+n-a-j+vg-up-use-v    1\ncake-n  as+n-the+ns-j-aid-n 2\ndog-n   as+n-a-j+vg-up-use-v    7\ndog-n   as+n-the+ns-j-aid-n 5'

In [19]: data.split("\n")
Out[19]:
['',
 'abandonment-n   about+n-the+v-know-v    1',
 'abandonment-n   above+ns-j+vn-pass-continue-v   1',
 'abandonment-n   after+n-the+n-a-j-stop-n    1',
 'abandonment-n   as+n-the+ns-j-aid-n 1',
 'cake-n  against+n-the+vg-restv  1',
 'cake-n  as+n-a+vd-require-v 1',
 'cake-n  as+n-a-j+vg-up-use-v    1',
 'cake-n  as+n-the+ns-j-aid-n 2',
 'dog-n   as+n-a-j+vg-up-use-v    7',
 'dog-n   as+n-the+ns-j-aid-n 5']

In [20]: data_lines = [x for x in data.split("\n") if x]

In [21]: data_lines
Out[21]:
['abandonment-n   about+n-the+v-know-v    1',
 'abandonment-n   above+ns-j+vn-pass-continue-v   1',
 'abandonment-n   after+n-the+n-a-j-stop-n    1',
 'abandonment-n   as+n-the+ns-j-aid-n 1',
 'cake-n  against+n-the+vg-restv  1',
 'cake-n  as+n-a+vd-require-v 1',
 'cake-n  as+n-a-j+vg-up-use-v    1',
 'cake-n  as+n-the+ns-j-aid-n 2',
 'dog-n   as+n-a-j+vg-up-use-v    7',
 'dog-n   as+n-the+ns-j-aid-n 5']

In [22]: split_lines = [x.split() for x in data_lines]

In [23]: split_lines
Out[23]:
[['abandonment-n', 'about+n-the+v-know-v', '1'],
 ['abandonment-n', 'above+ns-j+vn-pass-continue-v', '1'],
 ['abandonment-n', 'after+n-the+n-a-j-stop-n', '1'],
 ['abandonment-n', 'as+n-the+ns-j-aid-n', '1'],
 ['cake-n', 'against+n-the+vg-restv', '1'],
 ['cake-n', 'as+n-a+vd-require-v', '1'],
 ['cake-n', 'as+n-a-j+vg-up-use-v', '1'],
 ['cake-n', 'as+n-the+ns-j-aid-n', '2'],
 ['dog-n', 'as+n-a-j+vg-up-use-v', '7'],
 ['dog-n', 'as+n-the+ns-j-aid-n', '5']]

In [24]: df = pandas.DataFrame(split_lines, columns=["Col1", "Col2", "Col3"])

In [25]: df
Out[25]:
            Col1                           Col2 Col3
0  abandonment-n           about+n-the+v-know-v    1
1  abandonment-n  above+ns-j+vn-pass-continue-v    1
2  abandonment-n       after+n-the+n-a-j-stop-n    1
3  abandonment-n            as+n-the+ns-j-aid-n    1
4         cake-n         against+n-the+vg-restv    1
5         cake-n            as+n-a+vd-require-v    1
6         cake-n           as+n-a-j+vg-up-use-v    1
7         cake-n            as+n-the+ns-j-aid-n    2
8          dog-n           as+n-a-j+vg-up-use-v    7
9          dog-n            as+n-the+ns-j-aid-n    5

In [26]: df.groupby("Col1").apply(lambda x: (x.Col3.values[None,:],))
Out[26]:
Col1
abandonment-n    (array([[1, 1, 1, 1]], dtype=object),)
cake-n           (array([[1, 1, 1, 2]], dtype=object),)
dog-n                  (array([[7, 5]], dtype=object),)

In [27]: n_rows = len(df.Col1.unique())

In [28]: n_cols = len(df.Col2.unique())

In [29]: n_rows, n_cols
Out[29]: (3, 7)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM