简体   繁体   English

转换 Dataframe 数据集<row>为特定列的字符串数据类型的 JSON 格式,并将 JSON 字符串转换回 Dataframe</row>

[英]Convert Dataframe Dataset<Row> to JSON Format of String Data Type for particular Columns and convert the JSON String back to Dataframe

I have a dataframe.我有一个 dataframe。 I need to call a Rest API for each record.我需要为每条记录调用 Rest API。

Lets say the Dataframe looks like:假设 Dataframe 看起来像:

|----|-------------|-----|---------|
|UUID|PID          |DEVID|FIRSTNAME|
|----|-------------|-----|---------|
|1111|1234567891011|ABC11|JOHN     |
|2222|9876543256827|ABC22|HARRY    |
|----|-------------|-----|---------|

The JSON request string for first row should look like(Note: the json is created on 2 columns, not all), as the Rest API to be called requires the input in this format: The JSON request string for first row should look like(Note: the json is created on 2 columns, not all), as the Rest API to be called requires the input in this format:

{"applicationInfo": {"appId": "ec78fef4-92b9-3b1b-a68d-c45376b6977a"}, "requestData": [{"secureData": "JOHN", "secureDataType": "FIRSTNAME", "index": 1 }, {"secureData": "1234567891011", "secureDataType": "PID", "index": 2 } ] }

The value of index key has to be generted on the fly, using an incremental counter for each row.索引键的值必须动态生成,对每一行使用增量计数器。

Then, i need to call the Rest API sending the above JSON as a string param.然后,我需要调用 Rest API 发送上述 JSON 作为字符串参数。

The response from the API after encryption will look like:加密后 API 的响应如下所示:

{"responseData":[{"resultCode":"00","secureData":"63ygdydshbhgvdyw3et7edgu","secureDataType":"FIRSTNAME","index":1},{"resultCode":"00","secureData":"HKJJBJHVHG66456456FXXFFCGF","secureDataType":"PID","index":2}],"responseCode":"00","responseMessage":"SUCCESS","resultCounts":{"totalCount":2,"successCount":2,"failedCount":0}}

Then I need to read the above response and create a dataframe which should look like:然后我需要阅读上面的响应并创建一个 dataframe 应该如下所示:

|----|--------------------------|-----|------------------------|
|UUID|PID                       |DEVID|FIRSTNAME               |
|----|--------------------------|-----|------------------------|
|1111|HKJJBJHVHG66456456FXXFFCGF|ABC11|63ygdydshbhgvdyw3et7edgu|
|----|--------------------------|-----|------------------------|

If i convert the initial input dataframe toJSON().collectAsList(), then it looks like:如果我将初始输入 dataframe 转换为 JSON().collectAsList(),那么它看起来像:

[{"UUID":"1111","PID":"1234567891011","DEVID":"ABC11","FIRSTNAME":"JOHN"}, {"UUID":"2222","PID":"9876543256827","DEVID":"ABC22","FIRSTNAME":"HARRY"}]

But this doesnt work as the Rest API requires its input in a certain format, mentioned above.但这不起作用,因为 Rest API 需要以某种格式输入,如上所述。 Please help.请帮忙。

For the above, I assume that the data set has been partitioned across the number of Spark workers and it is a generic data set of Row (data frame), then the below mechanism can be employed.对于上述情况,我假设数据集已经按 Spark 工作人员的数量进行了分区,并且它是 Row(数据帧)的通用数据集,那么可以采用以下机制。

  1. Define a class with the required attributes as a data container将具有所需属性的 class 定义为数据容器
  2. Take the data set content as a List (takeAsList method if data set, refer )将数据集内容作为一个List(如果是数据集, 请参考takeAsList方法)
  3. Create and populate the objects of your data container (and store in such a way to identify them later, you shall have to repopulate them with decrypted data)创建并填充数据容器的对象(并以以后识别它们的方式存储,您必须用解密的数据重新填充它们)
  4. Serialize the list into a JSON array with Jackson ( refer ) Step 4 & 5 can be combined with Jackson custom serializer refer example使用 Jackson 将列表序列化为 JSON 数组( 请参阅)第 4 步和第 5 步可以与 Jackson 自定义序列化程序结合使用, 请参阅示例
  5. Make the REST call and repopulate the data container objects (after deserializing the response with Jackson)进行 REST 调用并重新填充数据容器对象(在使用 Jackson 反序列化响应之后)
  6. Create a data frame ( an example )创建数据框(示例
  7. Process the data frame (dataset of rows)处理数据框(行数据集)

NOTE: The JSON structure you have provided seems not to be correct, JSON array is [{},{},{}]注意:您提供的 JSON 结构似乎不正确,JSON 数组是 [{},{},{}]


In your case, given the format of the request JSON, direct conversion of rows will not work, as mentioned in point 1, make a set of model classes, you could consider the below model classes.在您的情况下,给定请求 JSON 的格式,行的直接转换将不起作用,如第 1 点中所述,创建一组 model 类,您可以考虑以下 Z20F35E630DAF44DBDFA4C3F68F539 类。

package org.test.json;

import java.util.List;

public class RequestModel {

protected ApplicationInfo applicationInfo;
protected List<RequestData> requestData;

public ApplicationInfo getApplicationInfo() {return applicationInfo;}
public void setApplicationInfo(ApplicationInfo applicationInfo) {this.applicationInfo = applicationInfo;}

public List<RequestData> getRequestData() {return requestData;}
public void setRequestData(List<RequestData> requestData) {this.requestData = requestData;}

}//class closing




package org.test.json;

public class ApplicationInfo {

protected String appId;

public String getAppId() {return appId;}
public void setAppId(String appId) {this.appId = appId;}

}//class closing




package org.test.json;

public class RequestData {

protected String secureData;
protected String secureDataType;
protected int index;

public String getSecureData() {return secureData;}
public void setSecureData(String secureData) {this.secureData = secureData;}

public String getSecureDataType() {return secureDataType;}
public void setSecureDataType(String secureDataType) {this.secureDataType = secureDataType;}

public int getIndex() {return index;}
public void setIndex(int index) {this.index = index;}

}//class closing

Process the list as obtained from the data frame and populate the model classes and then convert with Jackson to get the request JSON.处理从数据帧中获得的列表并填充 model 类,然后使用 Jackson 进行转换以获取请求 JSON。


The below should do what you are looking for, don't directly run this, the data set is null下面应该做你要找的,不要直接运行这个,数据集是null

        //Do not run this, will generate NullPointer, for example only
    Dataset<Row> ds=null;
    List<Row> rows=ds.collectAsList();

    RequestModel request=new RequestModel();

    //Set application id
    ApplicationInfo appInfo=new ApplicationInfo();
    appInfo.setAppId("some id");
    request.setApplicationInfo(appInfo);

    List<RequestData> reqData=new ArrayList<>();
    for(int i=0;i<rows.size();i++) {

        //Incrementally generated for each row
        int index=i;

        Row r=rows.get(i);
        int rowLength=r.size();

        for(int j=0;j<rowLength;j++) {

            RequestData dataElement=new RequestData();
            dataElement.setIndex(index);

            switch(j) {

                case 1:{dataElement.setSecureData(r.getString(j));dataElement.setSecureDataType("PID");break;}
                case 3:{dataElement.setSecureDataType(r.getString(j));dataElement.setSecureDataType("FIRSTNAME");break;}
                default:{break;}

            }//switch closing

            reqData.add(dataElement);

        }//for closing

    }//for closing

I updated my code to correct the for loop.我更新了我的代码以更正 for 循环。 Now its giving correct result.现在它给出了正确的结果。

But how to flatten the response string and extract the PID and FIRSTNAME values from the ResponseModel obj.但是如何展平响应字符串并从 ResponseModel obj 中提取 PID 和 FIRSTNAME 值。

        List<Row> list = df.collectAsList();
        List<Row> responseList = new ArrayList<>();

            for(Row r: list) {
                            ObjectMapper objectMapper = new ObjectMapper();
                objectMapper.enable(SerializationFeature.INDENT_OUTPUT);

String responseStr = "{\"responseData\":[{\"resultCode\":\"00\",\"secureData\":\"63ygdydshbhgvdyw3et7edgu\",\"secureDataType\":\"FIRSTNAME\",\"index\":1},{\"resultCode\":\"00\",\"secureData\":\"HKJJBJHVHG66456456FXXFFCGF\",\"secureDataType\":\"PID\",\"index\":2}],\"responseCode\":\"00\",\"responseMessage\":\"SUCCESS\",\"resultCounts\":{\"totalCount\":2,\"successCount\":2,\"failedCount\":0}}";
                ResponseModel responseModel = objectMapper.readValue(responseStr, ResponseModel.class);
               responseList.add(RowFactory.create((String) r.getAs("UUID"),(String) r.getAs("DEVID")));
            Dataset<Row> test= spark.createDataFrame(responseList,schema);

}

For testing purpose, i have hardcoded the response string inside the loop.出于测试目的,我在循环中硬编码了响应字符串。

How to extract and add the value of PID and FIRSTNAME to the above responseList to create a dataframe(UUID, PID, DEVID, FIRSTNAME).如何提取 PID 和 FIRSTNAME 的值并将其添加到上述 responseList 以创建数据帧(UUID、PID、DEVID、FIRSTNAME)。 Here,这里,

ResponseModel class has- ResultCounts, List<ResponseData>, String responseCode, String responseMessage
ResponseData class has- String resultCode, String secureData, String secureDataType, int index

Please help请帮忙

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM