[英]Python Spark take random sample based on column
This is my python spark code这是我的 python 火花代码
def parseLinesEcf4(line): #get the fields we need
fields = line.split('\t')
id1 = fields[0]
id2 = fields[1]
ecfp4 = float(fields[2])
return (id1, id2, ecfp4) #return two fields
conf = SparkConf().setMaster("local").setAppName("Second")
sc = SparkContext(conf = conf)
fileTwo = sc.textFile("PS21_ECFP4.tsv") #loads the data
dataTwo = fileTwo.map(parseLinesEcf4)
My input looks like this我的输入看起来像这样
and the size of my file is around 900GB.我的文件大小约为 900GB。 What I need is to take the rows of which unique values of column 1 correspond to 10% of the unique values of the same column, because one compound has more than one entries.我需要的是获取第 1 列的唯一值对应于同一列的唯一值的 10% 的行,因为一种化合物具有多个条目。
I tried takeSampe() and sampleBy() but both don't return what I am looking for.我尝试了 takeSampe() 和 sampleBy() 但两者都没有返回我正在寻找的内容。
Any help??有什么帮助吗??
You can try to use pyspark.ml library.您可以尝试使用 pyspark.ml 库。
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
# Prepare training and test data.
data = spark.read.format("libsvm")\
.load("data/mllib/sample_linear_regression_data.txt")
train, test = data.randomSplit([0.9, 0.1], seed=12345)
https://spark.apache.org/docs/2.1.0/ml-tuning.html#example-model-selection-via-train-validation-split https://spark.apache.org/docs/2.1.0/ml-tuning.html#example-model-selection-via-train-validation-split
But be aware, to use it you need to vectorise your data using VectorAssembler
但请注意,要使用它,您需要使用VectorAssembler
对数据进行矢量化
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
dataset = spark.createDataFrame(
[(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],
["id", "hour", "mobile", "userFeatures", "clicked"])
assembler = VectorAssembler(
inputCols=["hour", "mobile", "userFeatures"],
outputCol="features")
output = assembler.transform(dataset)
print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'")
output.select("features", "clicked").show(truncate=False)
https://spark.apache.org/docs/latest/ml-features.html#vectorassembler https://spark.apache.org/docs/latest/ml-features.html#vectorassembler
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.