data science, pandas, python,

How to work with JSON in Pandas

Posted on Dec 12, 2019 · 6 mins read
Share this

JSON is widely used format for storing the data and exchanging. Many of the API’s response are JSON and being light weight it’s used almost everywhere

In this post we will learn how to import a JSON File, JSON String, JSON API Response and import it to Pandas dataframe and work with it.

Pandas has built-in function read_json to import the JSON Strings and Files into pandas dataframe and json_normalize function works with nested json but it’s little hard to understand how to use it. We will understand that hard part in a simpler way in this post

Pandas Read_JSON

You can read a JSON string and convert it into a pandas dataframe using read_json() function. Here is a json string stored in variable data

data = '''[{
      "_id": "26",
        "Description": "10 YEAR",
        "TypeLevel1": "INTEREST",
        "Currency": "USD",
        "Operational": true,
        "TypeLevel2": "LONG",
      "Settlement": "90",
          "SCSP": "Rajna",
        "BBT": "CITITYM9",
        "TCK": "ZN",
          "SMCP": "01",
          "SMCP2": "02"
    },
    {
      "_id": "27",
        "Description": "5 YEAR",
        "TypeLevel1": "PRINCIPAL",
        "Currency": "GNP",
        "Operational": false,
        "TypeLevel2": "LONG",
      "Settlement": "40",
          "SCSP": "Paus",
        "BBT": "CITITYM10",
        "TCK": "PY",
          "SMCP": "05",
          "SMCP2": "09"
    }]'''

We’ve imported the json string in data variable and specified the orient parameter as columns. it basically tells what is the format of the expected json. We will see how to use the orient columns while reading the json data in next section

pd.read_json(data,orient='columns')

orient parameter in read_json

orient parameter is set to define the format of the input JSON. As per the official documentation the orient parameter can be any of the following values

'split' : dict like {index -> [index], columns -> [columns], data -> [values]}

'records' : list like [{column -> value}, ... , {column -> value}]

'index' : dict like {index -> {column -> value}}

'columns' : dict like {column -> {index -> value}}

'values' : just the values array

Let’s understand what are these orient value means

Let’s create a dataframe with index r1, r2 and column c1 and c2

df = pd.DataFrame([[1, 2], [3, 4]],
                  index=['r1', 'r2'],
                   columns=['c1', 'c2'])

orient index

We will change the dataframe to a JSON string and orient set to index

df.to_json(orient='index')

The output shows a json string with dataframe indexes as keys i.e. r1 & r2

Output: ‘{“r1”:{“c1”:1,”c2”:2},”r2”:{“c1”:3,”c2”:4}}’

orient split

This time we will set the orient value as split for the same dataframe

df.to_json(orient='split')

The output is a JSON String with keys as columns, index and data so basically it splitted the above dataframe into these three key and their corresponding values

**Output:** '{"columns":["c1","c2"],"index":["r1","r2"],"data":[[1,2],[3,4]]}'

orient records

if we set orient as records then it will give list of dictionary and each dictionary will contains the row values.

df.to_json(orient='records')

So here in the output column c1,c2 for row r1,r2 is displapyed inside the list

**Output:** '[{"c1":1,"c2":2},{"c1":3,"c2":4}]'

read json with orient

As shown above if we set the orient as any of the above strings then it will import the data accordingly

Going back to our read_json function above we have seen that setting the parameter orient index imports all the column values row wise

Similarly in this statement the json string values are imported as columns and the index is r1,r2 because the ouput above was ‘{“r1”:{“c1”:1,”c2”:2},”r2”:{“c1”:3,”c2”:4}}’

pd.read_json(_, orient='index')

Here the index changes to 0,1 because we have set the orient as records and all the values are imported as columns for each row because the output above for orient record is list of column values as dictionary ‘[{“c1”:1,”c2”:2},{“c1”:3,”c2”:4}]’

pd.read_json(_, orient='records')

JSON_Normalize function

Import nested JSON API Response using json_normalize

We are using openweather api to get the climate details for the next 30 days for the city of Mountain View in US

Use your api-key by registering to their site and convert the api response to python object

import requests
weather_json = requests.get("http://pro.openweathermap.org/data/2.5/climate/month?zip=94040,us&appid=<your_api_key>")
weather_api_data=json.loads(weather_json.text)

API Response

{'city': {'coord': {'lat': 37.3855, 'lon': -122.088},
  'country': 'US',
  'id': 0,
  'name': 'Mountain View'},
 'cod': '200',
 'list': [{'dt': 1576108800,
   'humidity': 76.85,
   'pressure': 1019.38,
   'temp': {'average': 282.94,
    'average_max': 288.46,
    'average_min': 280.05,
    'record_max': 292.32,
    'record_min': 272.48},
   'wind_speed': 1.61}],
 'message': 0.587190406}

Use json_normalize() function to convert the api response into dataframe. In the next section we will understand how the record path and meta parameters are used to convert the nested json to dataframe


df=json_normalize(weather_api_data,['list'],meta=['cod',['city','country'],['city','name'],['city','id'],['city','coord','lat'],
                                                 ['city','coord','lon']])
df.head()

Final Dataframe

how json_normalize works for nested JSON

record_path

We have to specify the Path in each object to list of records. In the above json “list” is the json object that contains list of json object which we want to import in the dataframe, basically list is the nested object in the entire json. so we specify this path under records_path

df=json_normalize(weather_api_data,record_path = ['list'])

meta

This contains the list of paths. So in our case we want all other objects which are outside “list” to be imported as dataframe column. So we have to specify all those column name as list i.e.

meta=['cod',['city','country'],['city','name'],['city','id'],['city','coord','lat'],
  ['city','coord','lon']]

You must be wondering why there are list inside that meta list like [‘city’,’country’] , [‘city’,’name’] etc.

If you check the above JSON, city is a json object which has coordinate, country, id and name fields inside it

So in order to import all those fields in the dataframe we have to specify that as a list inside the meta list like [‘city’,’coord’,’lat’] , [‘city’,’name’] etc.

In the dataframe those columns are shown as city.coord.lat and city.name

Another Example of nested json response using json_normalize

Let’s understand this using another example

Here is the nested JSON we want to import in a dataframe

data = {
  "results": [
    {
      "_id": "25",
      "Product": {
        "Description": "3 YEAR",
        "TypeLevel1": "INTEREST",
        "TypeLevel2": "LONG"
      },
      "Settlement": {},
      "Xref": {
        "SCSP": "96",
          "BBT":"81"
      },
      "ProductSMCP": [
        {
          "SMCP": "01"
        }
      ]
    },
    {
      "_id": "26",
      "Product": {
        "Description": "10 YEAR",
        "TypeLevel1": "INTEREST",
        "Currency": "USD",
        "Operational": True,
        "TypeLevel2": "LONG"
      },
      "Settlement": {},
      "Xref": {
          "SCSP": "Rajna",
        "BBT": "CITITYM9",
        "TCK": "ZN"
      },
      "ProductSMCP": [
        {
          "SMCP": "01"
        },
        {
          "SMCP2": "02"
        }
      ]
    }
  ]
}

Our data is stored in results field, so we will use data[‘results’] as our dictionary item to be imported into dataframe

In this json we have field ProductSMCP which is a json array and we will pass that in record_path parameter

In the meta parameter we will pass other fields which we want to import in the dataframe i.e. meta = [‘Settlement’,[‘Xref’,’SCSP’],[‘Xref’,’BBT’],’_id’,[‘Product’,’Description’]]

Here is the complete line of code for importing this json into dataframe

json_normalize(data['results'],record_path=['ProductSMCP'],meta=['Settlement',['Xref','SCSP'],['Xref','BBT'],'_id',['Product','Description']])

You cannot see other fields of Product like Typelevel1, Typelevel2 etc. because for this example I have never passed that inside the meta list

Conclusion:

So we have come to an end of this long post and we have seen different ways to import the regular and nested JSON into pandas dataframe using read_json() and json_normalize()

We have also seen how to import Json data from api response and json string directly into a pandas dataframe

if you have any comments or suggestions please feel free to drop a note in the comments section below