[英]A GenericUDF Function to Extract a Field From an Array of Structs
我正在嘗試編寫GenericUDF函數來為每個記錄收集數組中的所有特定結構字段,並將它們也返回數組中。
我寫了GenericUDF(如下),它似乎可以工作,但是:
1)當我在外部表上執行此操作時不起作用,在托管表上工作正常,有什么想法嗎?
2)我很難對此進行測試。 我已經附加了到目前為止的測試,但是它無法正常工作,總是會導致“ java.util.ArrayList無法轉換為org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector”或無法將String轉換為LazyString”,我的問題是如何為evalue方法提供結構列表?
任何幫助將不勝感激。
桌子:
CREATE EXTERNAL TABLE FOO (
TS string,
customerId string,
products array< struct<productCategory:string> >
)
PARTITIONED BY (ds string)
ROW FORMAT SERDE 'some.serde'
WITH SERDEPROPERTIES ('error.ignore'='true')
LOCATION 'some_locations'
;
一行記錄保存:
1340321132000, 'some_company', [{"productCategory":"footwear"},{"productCategory":"eyewear"}]
這是我的代碼:
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.serde2.lazy.LazyString;
import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.StructField;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector;
import org.apache.hadoop.io.Text;
import java.util.ArrayList;
@Description(name = "extract_product_category",
value = "_FUNC_( array< struct<productcategory:string> > ) - Collect all product category field values inside an array of struct(s), and return the results in an array<string>",
extended = "Example:\n SELECT _FUNC_(array_of_structs_with_product_category_field)")
public class GenericUDFExtractProductCategory
extends GenericUDF
{
private ArrayList ret;
private ListObjectInspector listOI;
private StructObjectInspector structOI;
private ObjectInspector prodCatOI;
@Override
public ObjectInspector initialize(ObjectInspector[] args)
throws UDFArgumentException
{
if (args.length != 1) {
throw new UDFArgumentLengthException("The function extract_product_category() requires exactly one argument.");
}
if (args[0].getCategory() != Category.LIST) {
throw new UDFArgumentTypeException(0, "Type array<struct> is expected to be the argument for extract_product_category but " + args[0].getTypeName() + " is found instead");
}
listOI = ((ListObjectInspector) args[0]);
structOI = ((StructObjectInspector) listOI.getListElementObjectInspector());
if (structOI.getAllStructFieldRefs().size() != 1) {
throw new UDFArgumentTypeException(0, "Incorrect number of fields in the struct, should be one");
}
StructField productCategoryField = structOI.getStructFieldRef("productCategory");
//If not, throw exception
if (productCategoryField == null) {
throw new UDFArgumentTypeException(0, "NO \"productCategory\" field in input structure");
}
//Are they of the correct types?
//We store these object inspectors for use in the evaluate() method
prodCatOI = productCategoryField.getFieldObjectInspector();
//First are they primitives
if (prodCatOI.getCategory() != Category.PRIMITIVE) {
throw new UDFArgumentTypeException(0, "productCategory field must be of string type");
}
//Are they of the correct primitives?
if (((PrimitiveObjectInspector)prodCatOI).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING) {
throw new UDFArgumentTypeException(0, "productCategory field must be of string type");
}
ret = new ArrayList();
return ObjectInspectorFactory.getStandardListObjectInspector(PrimitiveObjectInspectorFactory.writableStringObjectInspector);
}
@Override
public ArrayList evaluate(DeferredObject[] arguments)
throws HiveException
{
ret.clear();
if (arguments.length != 1) {
return null;
}
if (arguments[0].get() == null) {
return null;
}
int numElements = listOI.getListLength(arguments[0].get());
for (int i = 0; i < numElements; i++) {
LazyString prodCatDataObject = (LazyString) (structOI.getStructFieldData(listOI.getListElement(arguments[0].get(), i), structOI.getStructFieldRef("productCategory")));
Text productCategoryValue = ((StringObjectInspector) prodCatOI).getPrimitiveWritableObject(prodCatDataObject);
ret.add(productCategoryValue);
}
return ret;
}
@Override
public String getDisplayString(String[] strings)
{
assert (strings.length > 0);
StringBuilder sb = new StringBuilder();
sb.append("extract_product_category(");
sb.append(strings[0]);
sb.append(")");
return sb.toString();
}
}
我的測試:
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF.DeferredObject;
import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.testng.annotations.Test;
import java.util.ArrayList;
import java.util.List;
public class TestGenericUDFExtractShas
{
ArrayList<String> fieldNames = new ArrayList<String>();
ArrayList<ObjectInspector> fieldObjectInspectors = new ArrayList<ObjectInspector>();
@Test
public void simpleTest()
throws Exception
{
ListObjectInspector firstInspector = new MyListObjectInspector();
ArrayList test = new ArrayList();
test.add("test");
ArrayList test2 = new ArrayList();
test2.add(test);
StructObjectInspector soi = ObjectInspectorFactory.getStandardStructObjectInspector(test, test2);
fieldNames.add("productCategory");
fieldObjectInspectors.add(PrimitiveObjectInspectorFactory.writableStringObjectInspector);
GenericUDF.DeferredObject firstDeferredObject = new MyDeferredObject(test2);
GenericUDF extract_product_category = new GenericUDFExtractProductCategory();
extract_product_category.initialize(new ObjectInspector[]{firstInspector});
extract_product_category.evaluate(new DeferredObject[]{firstDeferredObject});
}
public class MyDeferredObject implements DeferredObject
{
private Object value;
public MyDeferredObject(Object value) {
this.value = value;
}
@Override
public Object get() throws HiveException
{
return value;
}
}
private class MyListObjectInspector implements ListObjectInspector
{
@Override
public ObjectInspector getListElementObjectInspector()
{
return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldObjectInspectors);
}
@Override
public Object getListElement(Object data, int index)
{
List myList = (List) data;
if (myList == null || index > myList.size()) {
return null;
}
return myList.get(index);
}
@Override
public int getListLength(Object data)
{
if (data == null) {
return -1;
}
return ((List) data).size();
}
@Override
public List<?> getList(Object data)
{
return (List) data;
}
@Override
public String getTypeName()
{
return null; //To change body of implemented methods use File | Settings | File Templates.
}
@Override
public Category getCategory()
{
return Category.LIST;
}
}
}
我無法對測試進行討論,但是通過下面討論的警告,我認為我對於外部表的問題有解決方案。
為了使您的代碼適應我的需要,我在評估方法中將字符串更改為long:
您的代碼:
LazyString prodCatDataObject = (LazyString) (structOI.getStructFieldData(listOI.getListElement(arguments[0].get(), i), structOI.getStructFieldRef("productCategory")));
Text productCategoryValue = ((StringObjectInspector) prodCatOI).getPrimitiveWritableObject(prodCatDataObject);
我的舊代碼:
LazyLong indDataObject = (LazyLong) (structOI.getStructFieldData(listOI.getListElement(arguments[0].get(), i), structOI.getStructFieldRef(indexName)));
LongWritable indValue = ((LazyLongObjectInspector) indOI).getPrimitiveWritableObject(indDataObject);
您可以看到它們是具有不同數據類型等的相同邏輯。
這為我使用非外部表工作。 不適用於外部表。
我可以用以下代碼替換舊代碼來解決此問題:
long indValue = (Long) (structOI.getStructFieldData(listOI.getListElement(arguments[0].get(), i), structOI.getStructFieldRef(indexName)));
在另一個版本中,我正在返回文本
您可能可以執行類似的操作,即在第一步中強制轉換為文本/字符串。
您可能還需要將public Text evaluate(DeferredObject[] arguments)
更改為public Object evaluate(DeferredObject[] arguments)
。
一些可用的處理數組的UDF的源代碼在此處 。
現在需要警告:這似乎不適用於存儲為ORC的表。 (請注意,原始代碼也沒有)。 我可能會對此提出一個問題。 我不確定是什么問題。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.