简体   繁体   中英

How to reconstruct sentences from these tokens

I have a dataset like this. The 1st column is the word and 2nd column is the tag .

 Pretty O bad O storm O here O last O evening O. O From O Green O Newsfeed O: O AHFA B-group extends O deadline O for O Sage O Award O to O Nov O. O

I want to reconstruct the sentences,

so the output will be like

[[('Pretty', 'O'), ('bad', 'O'), ('storm','O'), ('here', 'O'), ('last', 'O'), ('evening', 'O'), ('.', 'B-geo')][(From, 'O'), ('Green', 'O'), ('Newsfeed', 'O'), ('storm:,'O'), ('AHFA', 'B-group'), ('extends', 'O'), ('deadline', 'O'), ('for', 'O'),('Sage', 'O'), ('Award', 'B-geo')][(to, 'O'), ('Nov', 'O'), ('.','O']]

Can someone help me making the sentences from this.

If you have:

a = pd.DataFrame([('Pretty', 'O'), ('bad', 'O'), ('storm','O'), ('here', 'O'), ('last', 'O'), ('evening', 'O'), ('.', 'B-geo')])

then to get: [('Pretty', 'O'), ('bad', 'O'), ('storm','O'), ('here', 'O'), ('last', 'O'), ('evening', 'O'), ('.', 'B-geo')]

You can do:

[tuple(u) for u in a.values.tolist()]

Then you can do this for each one of your dataframe and concat all the list of tuple

If you have all your sentences in one dataframe like this:

a = pd.DataFrame([
('Pretty', 'O'), 
('bad', 'O'), 
('storm','O'), 
('here', 'O'), 
('last', 'O'), 
('evening', 'O'), 
('.', 'B-geo'), 
(' ',''),
('The', 'O'),
('World', 'O'),
('is', 'O'),
('...','N-geo')
])

you can find the index of " " ie space value and split your dataset like this:

index_list = a.index[a[0] == " "].tolist()
df1 = a.iloc[:index_list[0], :]
df2 = a.iloc[index_list[0]:, :]

So finally you'll have somethings like this:

def dataset_to_list_of_tuple(df):
    final_list = []
    index_list = df.index[df[0] == " "].tolist()
    for i in range(len(index_list)):
       if i == 0:
           df_part = df.iloc[:index_list[0], :]
       else:
           df_part = df.iloc[index_list[i-1]:index_list[i], :]
       sentence = [tuple(u) for u in df_part.values.tolist()]
       final_list.append(sentence)
   return final_list

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM