So I ran this simple code below in Palantir Foundry code workbook and it ran. Now I want to pass it a dataset which I imported and is sitting in my graph. The dataset is a pyspark dataframe which has one column with 1000 rows of text. So I want to substitute text="some random text"
with a spark dataset that contains many rows.
import nltk.tokenize as nt
import nltk
text="Being more Pythonic is good for health."
ss=nt.sent_tokenize(text)
tokenized_sent=[nt.word_tokenize(sent) for sent in ss]
pos_sentences=[nltk.pos_tag(sent) for sent in tokenized_sent]
pos_sentences
In your python transforms, you can wrap your code in an udf. An udf is not very performant but it would allow you to write exactly that code. ie:
def tokenize(text):
ss=nt.sent_tokenize(text)
tokenized_sent=[nt.word_tokenize(sent) for sent in ss]
return [nltk.pos_tag(sent) for sent in tokenized_sent]
tokenize_udf = F.udf(translate, T.StringType())
df.withColumn("result", tokenize_udf(F.col("text")))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.