[英]How to extract information from PCollection<Row> after a join in apache beam?
I have two example streams of data on which I perform innerJoin.我有两个示例数据流,我对其执行 innerJoin。 I would like to extend this piece of example join code and add some logic after the join occurs我想扩展这段示例连接代码并在连接发生后添加一些逻辑
public class JoinExample {
public static void main(String[] args) {
final Pipeline pipeline = Pipeline.create(pipelineOpts);
PCollection<Row> adStream =
pipeline
.apply(From.source("kafka.adStream"))
.apply(Select.fieldNames("ad.id", "ad.name"))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))));
PCollection<Row> clickStream =
pipeline
.apply(From.source("kafka.clickStream"))
.apply(Select.fieldNames("ad.id", "numClicks"))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))));
adStream
.apply(Join.<Row, Row>innerJoin(clickStream).using("id"))
.apply(ConsoleOutput.of(Row::toString)); // Instead of this output, I would like to just print the ad name and num clicks after the join
pipeline.run();
}
I would like to just print the ad name and num clicks after the join using a DoFcn like this:我想在加入后使用这样的 DoFcn 打印广告名称和点击次数:
adStream
.apply(Join.<Row, Row>innerJoin(clickStream).using("id"))
.apply(ParDo.of(new DoFcn(PCollection<Row>, int>() {
public void processElement(ProcessContext c) {
// Since there are two rows after the join, how can I get info from each row?
// Example in:
// ad.id = 1, ad.name = test
// ad.id = 1, numClicks = 1000
// After join
// Row: [Row:[1, test], Row:[1, 1000]]
// I tried this statement but it is incorrect
Row one = c.element.getRow(0); // This API is not available
}
}
Any ideas on how to extract this info from the joined data?关于如何从连接的数据中提取此信息的任何想法?
As you learned, the Schema Join method emulates the SQL join in which the result of the join is the concatenation of the rows from the joined PCollections.如您所知,Schema Join 方法模拟 SQL 联接,其中联接的结果是联接的 PCollections 中的行的串联。 In order to see which rows went into the inner join you have to use the CoGroup utility to join the PCollections.为了查看哪些行进入内部联接,您必须使用CoGroup实用程序来联接 PCollections。 This returns a Row
object with individual iterables for each of the PCollections that contains Row
s that match the key.这将返回一个Row
object,其中包含与键匹配的Row
的每个 PCollections 的单独迭代。 Example:例子:
import org.apache.beam.sdk.schemas.transforms.CoGroup;
import org.apache.beam.sdk.values.PCollectionTuple;
public class JoinExample {
public static void main(String[] args) {
final Pipeline pipeline = Pipeline.create(pipelineOpts);
PCollection<Row> adStream =
pipeline
.apply(From.source("kafka.adStream"))
.apply(Select.fieldNames("ad.id", "ad.name"))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))));
PCollection<Row> clickStream =
pipeline
.apply(From.source("kafka.clickStream"))
.apply(Select.fieldNames("ad.id", "numClicks"))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))));
// The names given here for the PCollections can be used to retrieve the
// the rows in the consuming PTransform. See below:
PCollectionTuple.of("adStream", adStream, "clickStream", clickStream)
// This selects the common field name in both adStream and clickStream
// to join on. See the documentation for ways of joining on
// different keys.
.apply(CoGroup.join(By.fieldNames("id")))
.apply(ParDo.of(new DoFn<Row, int>() {
public void processElement(ProcessContext c)
// Get key.
String id = c.element.getValue("key").id;
// Get rows from the adStream and clickStream PCollections that
// share the same id.
Iterable<Row> adStream = c.element.getValue("adStream");
Iterable<Row> clickStream = c.element.getValue("clickStream");
return 0;
}
}));
pipeline.run();
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.