简体   繁体   中英

Palantir foundry using imported dataset to perfomr nlp operation using pyspark

So I ran this simple code below in Palantir Foundry code workbook and it ran. Now I want to pass it a dataset which I imported and is sitting in my graph. The dataset is a pyspark dataframe which has one column with 1000 rows of text. So I want to substitute text="some random text" with a spark dataset that contains many rows.

import nltk.tokenize as nt
import nltk
text="Being more Pythonic is good for health."
ss=nt.sent_tokenize(text)
tokenized_sent=[nt.word_tokenize(sent) for sent in ss]
pos_sentences=[nltk.pos_tag(sent) for sent in tokenized_sent]
pos_sentences

In your python transforms, you can wrap your code in an udf. An udf is not very performant but it would allow you to write exactly that code. ie:

def tokenize(text):
   ss=nt.sent_tokenize(text)
   tokenized_sent=[nt.word_tokenize(sent) for sent in ss]
   return [nltk.pos_tag(sent) for sent in tokenized_sent]

tokenize_udf = F.udf(translate, T.StringType())

df.withColumn("result", tokenize_udf(F.col("text")))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM