How to construct JSON file using Pandas

Question

I am trying to take a CSV file and construct a JSON file with the values. The JSON file needs to be in a very specific format to be imported into Azure.

I am very new to Python, in fact this is the first time I'm using Python properly.

I have started to use Pandas to convert the csv into a dataframe, and then doing a small amount of formatting before converting into Json. This is a good start but it's not quite formatted properly. Please see below.

import pandas as pd

df=pd.read_csv("C:\\Users***Required_data.csv")

filtered = df['Work Item Type'].str.contains('Task')

dftest = df[filtered]
dftest = dftest.rename(columns={"Work Item    
Type":"System.WorkItemType","Title":"System.Title","AssignedTo":"System.AssignedTo","State":"System.State","Tags":"System.Tags","Description":"System.Description"})
dftest["System.AreaPath"] = "**********"

dftest.to_json(r"C:\\Users****\\Required_datatest.json",indent=4,orient="records")`

This gives me the following format in Json - An array of objects

source data:

My Attempt result:

[
    {
        "ID":15898,
        "System.WorkItemType":"Task",
        "System.Title":"TK 1.2.1 -  Example data",
        "System.AssignedTo":null,
        "System.State":"New",
        "System.Tags":null,
        "Parent":15887,
        "System.Description":"Example data",
        "System.AreaPath":"Example data"
    }
]

However i'm trying to build the following structure:

Target data in Json format:

{
      "count": 36,
      "value": [
        {
          "id": 487,
          "rev": 1,
          "fields": {
            "System.AreaPath": "Example data",
            "System.TeamProject": "Example data",
            "System.IterationPath": "Example data",
            "System.WorkItemType": "Task",
            "System.State": "New",
            "System.Reason": "New",
            "System.CreatedDate": "2021-02-22T19:13:24.81Z",
            "System.CreatedBy": "Example data",
            "System.ChangedDate": "2021-02-22T19:13:24.81Z",
            "System.ChangedBy": "Example data",
            "System.Title": "Example data",
            "Microsoft.VSTS.Scheduling.Effort": 0.0,
            "System.Description": "Example data",
            "System.AssignedTo": null,
            "Microsoft.VSTS.Scheduling.RemainingWork": 0.0,
            "Microsoft.VSTS.Common.Priority": 2.0,
            "System.BoardLane": null,
            "System.Tags": null,
            "Microsoft.VSTS.TCM.Steps": null,
            "Microsoft.VSTS.TCM.Parameters": null,
            "Microsoft.VSTS.TCM.LocalDataSource": null,
            "Microsoft.VSTS.TCM.AutomationStatus": null,
            "System.History": null
          },
          "relations": [
            {
              "rel": "System.LinkTypes.Hierarchy-Reverse",
              "url": "Example data",
              "attributes": {
                "isLocked": "false",
                "name": "Parent"
              }
            }
          ],
          "url": "Example data"
        }
    ]
    }

As you can see the array is then wrapped inside of another object which has 'count' and 'value'. My dataframe is then stored inside of 'fields' in the 2nd picture which is required.

Can anyone offer guidance here? I'm a bit stuck. If Pandas is not the correct tool please let me know. Please also provide the easiest solution as i'm still learning and would like to understand it.

Thank you in advance.

Answer 1

You could use a function like tranform_data below to do the additional transformation you need.

import pandas as pd


REVISION = 1


def exclude_keys(to_exclude: dict, *excluded_keys) -> dict:
    def predicate(key_val):
        key, val = key_val
        return key not in excluded_keys
    return dict(filter(predicate, to_exclude.items()))


def transform_data(to_transform: pd.DataFrame) -> dict:
    records = to_transform.to_dict("records")
    values = [
        {
            "id": record["ID"],
            "rev": REVISION,
            "fields": exclude_keys(record, "ID")
        }
        for record in records
    ]
    return {
        "count": len(records),
        "value": values
    }

Calling transform_data should have this result:

>>> transform_data(dftest)
{'count': 1,
 'value': [{'id': 15898,
   'rev': 1,
   'fields': {'System.WorkItemType': 'Task',
    'System.Title': 'TK 1.2.1 -  Example data',
    'System.AssignedTo': None,
    'System.State': 'New',
    'System.Tags': None,
    'Parent': 15887,
    'System.Description': 'Example data',
    'System.AreaPath': 'Example data'}}]}
>>> import json
>>> with open("~/path/to/output.json", "w") as fd: json.dump(transform_data(dftest), fd)

You should be able to adapt that code for any transformations or additional information you need to add to the data.

It might have been possible to do what I've done with raw Python using Pandas, but as far as I know, Pandas is best suited to flat tabular data rather than the nested data you need. Marshmallow might also be worth looking at as it handles nested JSON well: https://marshmallow.readthedocs.io/en/stable/quickstart.html

How to construct JSON file using Pandas

Question

1 answers

solution1
0 ACCPTED 2022-01-15 13:19:17

How to construct JSON file using Pandas

Question

1 answers

solution1 0 ACCPTED 2022-01-15 13:19:17

solution1
0 ACCPTED 2022-01-15 13:19:17