简体   繁体   中英

Convert list of (feature, value) tuple to numpy array

Imagining I have the data of the word count in the sentence, where each sentence is an instance.

For example, this is the data for the sentence “I love apple love” and “Oh my god apple apple apple”: data = [[(“I”, 1), (“love”, 2), (“apple”, 1)],[(“Oh”, 1), (“my”, 1), (“god”, 1), (“apple”, 3)]]

I want to convert this to the 2-d np array, where the features are word, and the value of the feature is the word frequency, in this case:

sentence id I love apple Oh my god
0 1 2 1 0 0 0
1 0 0 3 1 1 1
>>> import pandas as pd

>>> data = [[("I", 1), ("love", 2), ("apple", 1)],[("Oh", 1), ("my", 1), ("god", 1), ("apple", 3)]]

>>> data
[[('I', 1), ('love', 2), ('apple', 1)], [('Oh', 1), ('my', 1), ('god', 1), ('apple', 3)]]

>>> dfs = []
>>> for item in data:
      val = dict(item)
      index = [' '.join(dict(item).keys())]
      df = pd.DataFrame(val, index=index)
      dfs.append(df)
    
>>> sent_df = pd.concat(dfs)

>>> sent_df
                   I  love  apple   Oh   my  god
I love apple     1.0   2.0      1  NaN  NaN  NaN
Oh my god apple  NaN   NaN      3  1.0  1.0  1.0

>>> sent_df.index.name = 'sentence'

>>> sent_df = sent_df.reset_index().fillna(0)
>>> sent_df
          sentence    I  love  apple   Oh   my  god
0     I love apple  1.0   2.0      1  0.0  0.0  0.0
1  Oh my god apple  0.0   0.0      3  1.0  1.0  1.0

# if you don't want sentence inside the dataframe
# ===============================================

>>> sent_df = sent_df.drop('sentence', axis=1)

>>> sent_df
     I  love  apple   Oh   my  god
0  1.0   2.0      1  0.0  0.0  0.0
1  0.0   0.0      3  1.0  1.0  1.0

>>> sent_df.index.name = 'sentence_id'

>>> sent_df.reset_index()
   sentence_id    I  love  apple   Oh   my  god
0            0  1.0   2.0      1  0.0  0.0  0.0
1            1  0.0   0.0      3  1.0  1.0  1.0

# if you want 2-D numpy array (numpy array doesn't preserve column names)
# =======================================================================

>>> sent_df.reset_index().to_numpy()
array([[0., 1., 2., 1., 0., 0., 0.],
       [1., 0., 0., 3., 1., 1., 1.]])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM