简体   繁体   中英

converting 3 column dataframe to matrix with columns defined by range

I have a 3 column dataframe. Let's say my columns are "doc" , "word" , "count" and each row shows number of occurrences of a word in a document.

| doc | word | count |
+-----+------+-------+
|   0 |    0 |    10 |
|   0 |    7 |     2 |
|   0 |    4 |     5 |
|   1 |    2 |     5 |
+-----+------+-------+

I want to convert this dataframe to a matrix having rows as documents and columns as words so I do the following:

matrix = pd.pivot_table(my_df, index="doc", columns="word", values="count", fill_value=0)

What I get is a matrix having columns [0,2,4,7] . However, what I want is to have another range for my columns, eg range(10): [0,1,2,3,4,5,6,7,8,9] . This latter will end up some columns having all entries as 0 and this is what I want.

How can I achieve this?

You are asking for reindex :

matrix = (pd.pivot_table(df, index="doc",
                        columns="word", 
                        values="count", fill_value=0)
            .reindex(range(10), axis=1, fill_value=0)
         )

Output:

word   0  1  2  3  4  5  6  7  8  9
doc                                
0     10  0  0  0  5  0  0  2  0  0
1      0  0  5  0  0  0  0  0  0  0

IIUC, you want to create a sparse matrix document vs words, you could do:

import pandas as pd
from scipy.sparse import csr_matrix

rows, cols, data = zip(*df.to_numpy())
mat = csr_matrix((data, (rows, cols)), shape=(max(rows) + 1, max(cols) + 1))
res = pd.DataFrame(data=mat.toarray())
print(res)

Output

    0  1  2  3  4  5  6  7
0  10  0  0  0  5  0  0  2
1   0  0  5  0  0  0  0  0

With this approach the range is determined automatically.

UPDATE

If you want to have 10 columns you could do:

rows, cols, data = zip(*df.to_numpy())
mat = csr_matrix((data, (rows, cols)), shape=(max(rows) + 1, 10))
res = pd.DataFrame(data=mat.toarray())
print(res)

Output

    0  1  2  3  4  5  6  7  8  9
0  10  0  0  0  5  0  0  2  0  0
1   0  0  5  0  0  0  0  0  0  0

Simply add the columns that do not exist and fill with 0:

df = pd.pivot_table(my_df, index="doc", columns="word", values="count", fill_value=0)
for c in range(10):
    if c not in df.columns:
        df[c] = 0
matrix = df[list(range(10))].values

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM