Palantir foundry using imported dataset to perfomr nlp operation using pyspark

Question

So I ran this simple code below in Palantir Foundry code workbook and it ran. Now I want to pass it a dataset which I imported and is sitting in my graph. The dataset is a pyspark dataframe which has one column with 1000 rows of text. So I want to substitute text="some random text" with a spark dataset that contains many rows.

import nltk.tokenize as nt
import nltk
text="Being more Pythonic is good for health."
ss=nt.sent_tokenize(text)
tokenized_sent=[nt.word_tokenize(sent) for sent in ss]
pos_sentences=[nltk.pos_tag(sent) for sent in tokenized_sent]
pos_sentences

Answer 1

In your python transforms, you can wrap your code in an udf. An udf is not very performant but it would allow you to write exactly that code. ie:

def tokenize(text):
   ss=nt.sent_tokenize(text)
   tokenized_sent=[nt.word_tokenize(sent) for sent in ss]
   return [nltk.pos_tag(sent) for sent in tokenized_sent]

tokenize_udf = F.udf(translate, T.StringType())

df.withColumn("result", tokenize_udf(F.col("text")))

Palantir foundry using imported dataset to perfomr nlp operation using pyspark

Question

1 answers

solution1
1 2021-02-11 09:35:18

Palantir foundry using imported dataset to perfomr nlp operation using pyspark

Question

1 answers

solution1 1 2021-02-11 09:35:18

solution1
1 2021-02-11 09:35:18