简体   繁体   English

如何读取RCFile

[英]How to read in a RCFile

I am trying to read in a small RCFile (~200 rows of data) into a HashMap to do a Map-Side join, but I having a lot of trouble getting the data in the file into a usable state. 我试图将一个小的RCFile(约200行数据)读入HashMap中以进行Map-Side联接,但是我很难将文件中的数据转换为可用状态。

Here is what I have so far, most of which is lifted from this example : 这是我到目前为止的内容,其中大部分是从此示例中提取的:

    public void configure(JobConf job)                                                                                                   
    {   
        try
        {                                                                                                                                
            FileSystem fs = FileSystem.get(job);                                                                                         
            RCFile.Reader rcFileReader = new RCFile.Reader(fs, new Path("/path/to/file"), job);          
            int counter = 1;   
            while (rcFileReader.next(new LongWritable(counter)))
            {
                System.out.println("Fetching data for row " + counter);                                                  
                BytesRefArrayWritable dataRead = new BytesRefArrayWritable();                                                            
                rcFileReader.getCurrentRow(dataRead);                                                                                    
                System.out.println("dataRead: " + dataRead + " dataRead.size(): " + dataRead.size());
                for (int i = 0; i < dataRead.size(); i++)                                                                                
                {
                    BytesRefWritable bytesRefRead = dataRead.get(i);                               
                    byte b1[] = bytesRefRead.getData();                                                                                  
                    Text returnData = new Text(b1);
                    System.out.println("READ-DATA = " + returnData.toString());                                                          
                }                                                        
                counter++;
            } 
        }
        catch (IOException e)
        {             
            throw new Error(e);
        }             
    }   

However, the output that I am getting has all of the data in each column concatenated together in the first row and no data in any of the other rows. 但是,我得到的输出在第一行中将每一列中的所有数据连接在一起,而在其他任何行中都没有数据。

Fetching data for row 1
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@7f26d3df dataRead.size(): 5
READ-DATA = 191606656066860670
READ-DATA = United StatesAmerican SamoaGuamNorthern Mariana Islands
READ-DATA = USASGUMP
READ-DATA = USSouth PacificSouth PacificSouth Pacific
READ-DATA = 19888
Fetching data for row 2
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@1cb1a4e2 dataRead.size(): 0
Fetching data for row 3
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@52c00025 dataRead.size(): 0
Fetching data for row 4
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@3b49a794 dataRead.size(): 0

How do I read in this data properly so that I have access to one row at a time eg 如何正确读取此数据,以便一次可以访问一行

(191, United States, US, US, 19) ? (191, United States, US, US, 19)

After some more digging, I've found a solution. 经过更多的挖掘之后,我找到了解决方案。 The key here is to not use RCFile.Reader but to use RCFileRecordReader . 此处的关键是不使用RCFile.Reader而是使用RCFileRecordReader

Here is what I ended up with, adapted to open multiple files as well: 这是我最终得到的结果,它也适合于打开多个文件:

try                                                                                                                                     
{                                                                     
    FileSystem fs = FileSystem.get(job);                                                                                         
    FileStatus [] fileStatuses = fs.listStatus(new Path("/path/to/dir/"));                               
    LongWritable key = new LongWritable();                                                                                       
    BytesRefArrayWritable value = new BytesRefArrayWritable();                                                                   
    int counter = 1;                                                                                                             
    for (int i = 0; i < fileStatuses.length; i++)                                                                                
    {                                                                                                                            
        FileStatus fileStatus = fileStatuses[i];                                                                                 
        if (!fileStatus.isDir())                                                                                                 
        {                                                                                                                        
            System.out.println("File: " + fileStatus);                                                                           
            FileSplit split = new FileSplit(fileStatus.getPath(), 0, fileStatus.getLen(), job);                                  
            RCFileRecordReader reader = new RCFileRecordReader(job, split);                                                      
            while (reader.next(key, value))                                                                                      
            {                                                                                                                    
                System.out.println("Getting row " + counter);                                                                    
                AllCountriesRow acr = AllCountriesRow.valueOf(value);                                                            
                System.out.println("ROW: " + acr);                                                                                                                                                        
                counter++;                                                                                                       
            }                                                                                                                    
        }                                                                                                                        
    }                                                                                                                                                                                                                                                         
}                                                                                                                                
catch (IOException e)                                                                                                            
{                                                                                                                                
    throw new Error(e);                                                                                                          
}

And AllCountryiesRow.valueOf: 和AllCountryiesRow.valueOf:

(note that Column is an enum of the columns in the order that they appear in each row and serDe is an instance of ColumnarSerDe ) (注意, Column的是,他们似乎每一行的顺序列的枚举和serDe是实例ColumnarSerDe

public static AllCountriesRow valueOf(BytesRefArrayWritable braw) throws IOException                                                     
{   
    try                                                                                                                                  
    {
        StructObjectInspector soi = (StructObjectInspector) serDe.getObjectInspector();                                                  
        Object row = serDe.deserialize(braw);                                                                                                                                                                                 
        List<? extends StructField> fieldRefs = soi.getAllStructFieldRefs();                                                                                                                                              

        Object fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.ID.ordinal()));                                                                  
        ObjectInspector oi = fieldRefs.get(Column.ID.ordinal()).getFieldObjectInspector();                                               
        int id = ((IntObjectInspector)oi).get(fieldData);                                                                                

        fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.NAME.ordinal()));                                                   
        oi = fieldRefs.get(Column.NAME.ordinal()).getFieldObjectInspector();                                                             
        String name = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);                                                     

        fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.CODE.ordinal()));                                                   
        oi = fieldRefs.get(Column.CODE.ordinal()).getFieldObjectInspector();
        String code = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);                                                     

        fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.REGION_NAME.ordinal()));                                            
        oi = fieldRefs.get(Column.REGION_NAME.ordinal()).getFieldObjectInspector();                                                      
        String regionName = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);                                               

        fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.CONTINENT_ID.ordinal()));                                           
        oi = fieldRefs.get(Column.CONTINENT_ID.ordinal()).getFieldObjectInspector();                                                     
        int continentId = ((IntObjectInspector)oi).get(fieldData);                                                                       

        return new AllCountriesRow(id, name, code, regionName, continentId);                                                             
    }               
    catch (SerDeException e)
    {               
        throw new IOException(e);                                                                                                        
    }                   
}                       

This ends up with an AllCountriesRow object that has all the information of the relevant row in it. 最后以AllCountriesRow对象结束,该对象中包含相关行的所有信息。

Due to the columnar nature of RCFiles, the row wise read path is significantly different from the write path. 由于RCFiles的列性质,按行读取的路径与写入路径明显不同。 We can still use the RCFile.Reader class to read RCFile row-wise (RCFileRecordReader is not needed). 我们仍然可以使用RCFile.Reader类逐行读取RCFile(不需要RCFileRecordReader)。 But in addition we would need to use a ColumnarSerDe to convert the columnar data to row wise data. 但是除此之外,我们将需要使用ColumnarSerDe将列数据转换为行数据。

Following is the most simplified code we could get to for reading a RCFile row wise. 以下是我们可以最明智地读取RCFile行的代码。 Please refer to inline code comments for more details. 请参阅内联代码注释以获取更多详细信息。

private static void readRCFileByRow(String pathStr)
  throws IOException, SerDeException {

  final Configuration conf = new Configuration();

  final Properties tbl = new Properties();

  /*
   * Set the column names and types using comma separated strings. 
   * The actual name of the columns are not important, as long as the count 
   * of column is correct.
   * 
   * For types, this example uses strings. byte[] can be stored as string 
   * by encoding the bytes to ASCII (such as hexString or Base64)
   * 
   * Numbers of columns and number of types must match exactly.
   */
  tbl.setProperty("columns", "col1,col2,col3,col4,col5");
  tbl.setProperty("columns.types", "string:string:string:string:string");

  /*
   * We need a ColumnarSerDe to de-serialize the columnar data to row-wise 
   * data 
   */
  ColumnarSerDe serDe = new ColumnarSerDe();
  serDe.initialize(conf, tbl);

  Path path = new Path(pathStr);
  FileSystem fs = FileSystem.get(conf);
  final RCFile.Reader reader = new RCFile.Reader(fs, path, conf);

  final LongWritable key = new LongWritable();
  final BytesRefArrayWritable cols = new BytesRefArrayWritable();

  while (reader.next(key)) {
    System.out.println("Getting next row.");

    /*
     * IMPORTANT: Pass the same cols object to the getCurrentRow API; do not 
     * create new BytesRefArrayWritable() each time. This is because one call
     * to getCurrentRow(cols) can potentially read more than one column
     * values which the serde below would take care to read one by one.
     */
    reader.getCurrentRow(cols);

    final ColumnarStruct row = (ColumnarStruct) serDe.deserialize(cols);
    final ArrayList<Object> objects = row.getFieldsAsList();
    for (final Object object : objects) {
      // Lazy decompression happens here
      final String payload = 
        ((LazyString) object).getWritableObject().toString();
      System.out.println("Value:" + payload);
    }
  }
}

In this code, the getCourrentRow still reads the data column wise and we need to use a SerDe to convert it to row. 在此代码中,getCourrentRow仍然按列读取数据,我们需要使用SerDe将其转换为行。 Also, calling getCurrentRow() does not mean that all the fields in the row have been decompressed. 同样,调用getCurrentRow()并不意味着该行中的所有字段都已解压缩。 Actually, according to lazy decompression, a column will not be decompressed until one of its field is being deserialized. 实际上,根据延迟解压缩,在对字段的其中一个字段进行反序列化之前,不会对列进行解压缩。 For this, we have used coulmnarStruct.getFieldsAsList() to get a list of references to the lazy objects. 为此,我们使用了coulmnarStruct.getFieldsAsList()来获取对惰性对象的引用列表。 The actual reading happens in the getWritableObject() call on the LazyString reference. 实际读取发生在LazyString引用的getWritableObject()调用中。

Another way of achiving the same thing would be to use StructObjectInspector and use the copyToStandardObject API. 实现同一目标的另一种方法是使用StructObjectInspector并使用copyToStandardObject API。 But I find the above method simpler. 但是我发现上述方法更简单。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM