I have a function that tokenises words from a tuple:
def get_word_tokens(tokens):
words = [token[0] for token in tokens]
return words
I want to apply this to column in a dask dataframe and create a new column eg
df1
#phrase tokens
0 call CHRIS MOBILE. [(call, 0, 4),
(CHRIS, 5, 10),
(MOBILE, 11, 17)]
1 call Tod Sarks [(call, 0, 4),
(Tod, 5, 8),
(arks, 9, 14)]
Create column words
df1
#phrase tokens words
0 call CHRIS MOBILE. [(call, 0, 4), call, CHRIS, MOBILE
(CHRIS, 5, 10),
(MOBILE, 11, 17)]
1 call Tod Sarks [(call, 0, 4), call, Tod, Sarks
(Tod, 5, 8),
(Sarks, 9, 14)]
I have tried:
df['words'] = df.apply(lambda row: get_word_tokens(df['tokens']), axis = 1)
This appears to be working but is taking a very long time to run? Is there a faster method?
You are passing df['tokens']
to the function, which is the full column. This should work:
def get_word_tokens(tokens):
words = [token[0] for token in tokens]
return words
data = [
['call CHRIS MOBILE.', [('call', 0, 4),
('CHRIS', 5, 10),
('MOBILE', 11, 17)]],
['call Tod Sarks', [('call', 0, 4),
('Tod', 5, 8),
('arks', 9, 14)]],
]
import pandas as pd
df = pd.DataFrame(data, columns=['phrase', 'tokens'])
df = pd.concat([df,df,df,df, df, df])
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=2)
def get_word_tokens_df(df):
df['words'] = df['tokens'].apply(get_word_tokens)
return df
ddf = ddf.map_partitions(get_word_tokens_df)
ddf.compute()
Try this:
df.join(df['tokens'].str.extractall(r'([A-Za-z]\w+)').groupby(level=0).agg(','.join).squeeze().rename('words'))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.