[英]Error upon converting a pandas dataframe to spark DataFrame
我從一些StackOverFlow帖子中創建了一個熊貓數據框。 使用lxml.eTree分隔code_blocks和text_blocks。 下面的代碼顯示了基本輪廓:
import lxml.etree
a1 = tokensentRDD.map(lambda (a,b): (a,''.join(map(str,b))))
a2 = a1.map(lambda (a,b): (a, b.replace("<", "<")))
a3 = a2.map(lambda (a,b): (a, b.replace(">", ">")))
def parsefunc (x):
html = lxml.etree.HTML(x)
code_block = html.xpath('//code/text()')
text_block = html.xpath('// /text()')
a4 = code_block
a5 = len(code_block)
a6 = text_block
a7 = len(text_block)
a8 = ''.join(map(str,text_block)).split(' ')
a9 = len(a8)
a10 = nltk.word_tokenize(''.join(map(str,text_block)))
numOfI = 0
numOfQue = 0
numOfExclam = 0
for x in a10:
if x == 'I':
numOfI +=1
elif x == '?':
numOfQue +=1
elif x == '!':
numOfExclam
return (a4,a5,a6,a7,a9,numOfI,numOfQue, numOfExclam)
a11 = a3.take(6)
a12 = map(lambda (a,b): (a, parsefunc(b)), a11)
columns = ['code_block', 'len_code', 'text_block', 'len_text', 'words@text_block', 'numOfI', 'numOfQ', 'numOfExclam']
index = map(lambda x:x[0], a12)
data = map(lambda x:x[1], a12)
df = pd.DataFrame(data = data, columns = columns, index = index)
df.index.name = 'Id'
df
code_block len_code text_block len_text words@text_block numOfI numOfQ numOfExclam
Id
4 [decimal 3 [I want to use a track-bar to change a form's ... 18 72 5 1 0
6 [div, ] 5 [I have an absolutely positioned , div, conta... 22 96 4 4 0
9 [DateTime] 1 [Given a , DateTime, representing a person's ... 4 21 2 2 0
11 [DateTime] 1 [Given a specific , DateTime, value, how do I... 12 24 2 1 0
我需要在其上創建一個Spark DataFrame,以將機器學習算法應用於輸出。 我試過了:
sqlContext.createDataFrame(df).show()
我收到的錯誤是:
TypeError: not supported type: <class 'lxml.etree._ElementStringResult'>
有人可以告訴我將Pandas DataFrame轉換為Spark DataFrame的正確方法嗎?
您的問題與熊貓無關。 code_block
( a4
)和text_block
( a6
)都包含lxml
特定的對象,這些對象無法使用SparkSQL類型進行編碼。 將它們轉換為字符串應該足夠了。
a4 = [str(x) for x in code_block]
a6 = [str(x) for x in text_block]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.