Pig：用Java将字符串重新解析为元组

Question

I'll have a Pig script that ends with storing it's contents in a text file. 我将有一个Pig脚本，最后将其内容存储在文本文件中。

STORE foo into 'outputLocation';

During a completely different job I want to read lines of this file, and parse them back into Tuples. 在完全不同的工作中，我想读取此文件的行，并将其解析回Tuples。 The data in foo might contains chararrays with characters used when you save Pig Bags/tuples like { } ( ) , etc. I can read the previously saved file using code like. foo中的数据可能包含字符数组，这些字符数组带有保存{}（）等Pig Pigs / tuple时使用的字符。我可以使用诸如此类的代码读取以前保存的文件。

FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());
FileStatus[] fileStatuses = fs.listStatus(new Path("outputLocation"));

for (FileStatus fileStatus : fileStatuses) {
    if (fileStatus.getPath().getName().contains("part")) {
        DataInputStream in = fs.open(fileStatus.getPath());             
        String line;
        while ((line = in.readLine()) != null) {
           // Do stuff
        }
    }
}

Now where // Do stuff is, I'd like to parse my String back into a Tuple. 现在， // Do stuff操作，我想将我的String解析回一个Tuple。 Is this possible/ does Pig provide an API? Pig是否可以提供API？ The closest I could find is the StorageUtil class textToTuple function, but that just makes a Tuple containing one DataByteArray. 我能找到的最接近的是StorageUtil类的textToTuple函数，但这仅使Tuple包含一个DataByteArray。 I want a tuple containing other bags, tuples, chararrays like it originally had so I can refetch the original fields easily. 我想要一个包含其他包，元组和原来的字符数组的元组，以便我可以轻松地重新获取原始字段。 I can change the StoreFunc I save the original file in, if that helps. 如果有帮助，我可以更改保存原始文件的StoreFunc。

Answer 1

This is the plain Pig solution without using JSON or UDF. 这是不使用JSON或UDF的普通Pig解决方案。 I have found it the hard way. 我发现这很困难。

import org.apache.pig.ResourceSchema.ResourceFieldSchema;
import org.apache.pig.builtin.Utf8StorageConverter;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.newplan.logical.relational.LogicalSchema;
import org.apache.pig.impl.util.Utils;

Let's say your string to be parsed is this: 假设您要解析的字符串是这样的：

String tupleString = "(quick,123,{(brown,1.0),(fox,2.5)})";

First, parse your schema string. 首先，解析您的架构字符串。 Note that you have an enclosing tuple. 请注意，您有一个封闭的元组。

LogicalSchema schema = Utils.parseSchema("a0:(a1:chararray, a2:long, a3:{(a4:chararray, a5:double)})");

Then parse your tuple with your schema. 然后用您的模式解析元组。

Utf8StorageConverter converter = new Utf8StorageConverter();
ResourceFieldSchema fieldSchema = new ResourceFieldSchema(schema.getField("a0"));
Tuple tuple = converter.bytesToTuple(tupleString.getBytes("UTF-8"), fieldSchema);

Voila! 瞧！ Check your data. 检查您的数据。

assertEquals((String) tuple.get(0), "quick");
assertEquals(((DataBag) tuple.get(2)).size(), 2L);

Answer 2

I would just output the data into JSON format. 我只是将数据输出为JSON格式。 Pig has native support for parsing JSON until tuples. Pig具有对JSON解析到元组的本机支持。 It would avoid you having to write a UDF. 这样可以避免您必须编写UDF。

Pig：用Java将字符串重新解析为元组

问题描述

2 个解决方案

解决方案1
1 2013-11-04 10:14:27

解决方案2
0 已采纳 2013-03-28 23:13:48

Pig：用Java将字符串重新解析为元组

问题描述

2 个解决方案

解决方案1 1 2013-11-04 10:14:27

解决方案2 0 已采纳 2013-03-28 23:13:48

解决方案1
1 2013-11-04 10:14:27

解决方案2
0 已采纳 2013-03-28 23:13:48