How do I normalize json data into pandas (Covid-19 data)

Question

I am trying to import all up-to-date datasets in JSON format on the covid-19 pandemic into a pandas dataframe. I believe it should be possible by using json_normalize but I can't make it work.

First, I initialize and import the json data:

import pandas as pd
import numpy as np
import requests
import json
from pandas.io.json import json_normalize
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.json'
r = requests.get(url)
json = r.json()

The type of this JSON is dict. When I inspect the json, I can see a complicated, nested structure. First, there is a dict that consists of one dict for each location (using the names of "iso_code"). These dicts have some overall statistics as key/values and the the "data" dict containing the time series as a list of dicts. I hope I got it correct...

The output is one long line from which I can show the very first snip:

{"AFG":{"continent":"Asia","location":"Afghanistan","population":39835428.0,"population_density":54.422,"median_age":18.6,"aged_65_older":2.581,"aged_70_older":1.337,"gdp_per_capita":1803.987,"cardiovasc_death_rate":597.029,"diabetes_prevalence":9.59,"handwashing_facilities":37.746,"hospital_beds_per_thousand":0.5,"life_expectancy":64.83,"human_development_index":0.511,"data":[{"date":"2020-02-24","total_cases":5.0,"new_cases":5.0,"total_cases_per_million":0.126,"new_cases_per_million":0.126,"stringency_index":8.33},{"date":"2020-02-25",

For readability, I have formatted a snip with line breaks and indentation:

{
    "AFG":{
        "continent":"Asia",
        "location":"Afghanistan",
        "population":39835428.0,
        "population_density":54.422,
        "median_age":18.6,
        "aged_65_older":2.581,
        "aged_70_older":1.337,
        "gdp_per_capita":1803.987,
        "cardiovasc_death_rate":597.029,
        "diabetes_prevalence":9.59,
        "handwashing_facilities":37.746,
        "hospital_beds_per_thousand":0.5,
        "life_expectancy":64.83,
        "human_development_index":0.511,
        "data":[
            {
                "date":"2020-02-24",
                "total_cases":5.0,
                "new_cases":5.0,
                "total_cases_per_million":0.126,
                "new_cases_per_million":0.126,
                "stringency_index":8.33
                },
            .            .
            .
            {
                "date":"2021-12-27",
                "total_cases":157967.0,
                "new_cases":16.0,
                "new_cases_smoothed":24.286,
                "total_deaths":7354.0,
                "new_deaths":0.0,
                "new_deaths_smoothed":2.714,
                "total_cases_per_million":3965.49,
                "new_cases_per_million":0.402,
                "new_cases_smoothed_per_million":0.61,
                "total_deaths_per_million":184.61,
                "new_deaths_per_million":0.0,
                "new_deaths_smoothed_per_million":0.068
                }
            ]
        },
    .
    .
    "ZWE":{
        "continent":"Africa",
        "location":"Zimbabwe",
        "population":14862927.0,
        "population_density":42.729,
        "median_age":19.6
        "aged_65_older":2.822,
        "aged_70_older":1.882,
        "gdp_per_capita":1899.775,
        "extreme_poverty":21.4,
        "cardiovasc_death_rate":307.846,
        "diabetes_prevalence":1.82,
        "female_smokers":1.6,
        "male_smokers":30.7,
        "handwashing_facilities":36.791,
        "hospital_beds_per_thousand":1.7,
        "life_expectancy":61.49,
        "human_development_index":0.571,
        "data":[
            {
                "date":"2020-03-23",
                "total_cases":3.0,
                "new_cases":0.0,
                "total_deaths":1.0,
                "new_deaths":1.0,
                "total_cases_per_million":0.202,
                "new_cases_per_million":0.0,
                "total_deaths_per_million":0.067,
                "new_deaths_per_million":0.067,
                "stringency_index":45.37
                },
            .
            .
            {
                "date":"2021-12-28",
                "total_cases":207548.0,
                "new_cases":2099.0,
                "new_cases_smoothed":1397.143,
                "total_deaths":4940.0,
                "new_deaths":32.0,
                "new_deaths_smoothed":17.286,
                "total_cases_per_million":13752.031,
                "new_cases_per_million":139.079,
                "new_cases_smoothed_per_million":92.574,
                "total_deaths_per_million":327.322,
                "new_deaths_per_million":2.12,
                "new_deaths_smoothed_per_million":1.145,
                "total_vaccinations":7222341.0,
                "people_vaccinated":4107151.0,
                "people_fully_vaccinated":3115190.0,
                "new_vaccinations":3903.0,
                "new_vaccinations_smoothed":10056.0,
                "total_vaccinations_per_hundred":47.85,
                "people_vaccinated_per_hundred":27.21,
                "people_fully_vaccinated_per_hundred":20.64,
                "new_vaccinations_smoothed_per_million":666.0,
                "new_people_vaccinated_smoothed":5407.0,
                "new_people_vaccinated_smoothed_per_hundred":0.036
                }
            ]
        }
}

My goal is in the first place to be able to access the time series for e.g. "new_cases_per_million" for each location. Further on, I could imagine to relate e.g. "total_cases_per_million" or "new_vaccinations_smoothed_per_million" to "gdp_per_capita" and so on.

As the actual time series are nested into the "data" dict, I tried something like this:

data = pd.json_normalize(json,record_path=['data'],errors='ignore')

which gives the error:

KeyError                                  Traceback (most recent call last)
<ipython-input-20-602e9c355c71> in <module>
----> 1 data = pd.json_normalize(json,record_path=['data'],errors='ignore')
/usr/local/lib/python3.8/site-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
    334                 records.extend(recs)
    335 
--> 336     _recursive_extract(data, record_path, {}, level=0)
    337 
    338     result = DataFrame(records)
/usr/local/lib/python3.8/site-packages/pandas/io/json/_normalize.py in _recursive_extract(data, path, seen_meta, level)
    307         else:
    308             for obj in data:
--> 309                 recs = _pull_records(obj, path[0])
    310                 recs = [
    311                     nested_to_record(r, sep=sep, max_level=max_level)
/usr/local/lib/python3.8/site-packages/pandas/io/json/_normalize.py in _pull_records(js, spec)
    246         if has non iterable value.
    247         """
--> 248         result = _pull_field(js, spec)
    249 
    250         # GH 31507 GH 30145, GH 26284 if result is not list, raise TypeError if not
/usr/local/lib/python3.8/site-packages/pandas/io/json/_normalize.py in _pull_field(js, spec)
    237                 result = result[field]
    238         else:
--> 239             result = result[spec]
    240         return result
    241
KeyError: 'data'

I can run the command without setting the record_path like this:

data = pd.json_normalize(json)

but this, of course, does not expand the nested data.

I have another pandas project where I import vaccination data as json from owid (see below):

The json data is structured slightly different.

Formatted with line breaks and indentation, it looks like:

[
  {
    "country": "Afghanistan",
    "iso_code": "AFG",
    "data": [
      {
        "date": "2021-02-22",
        "total_vaccinations": 0,
        "people_vaccinated": 0,
        "total_vaccinations_per_hundred": 0.0,
        "people_vaccinated_per_hundred": 0.0
      },
      .
      .
      {
        "date": "2021-12-30",
        "total_vaccinations": 305216136,
        "people_vaccinated": 193716017,
        "people_fully_vaccinated": 125428809,
        "total_boosters": 860635,
        "daily_vaccinations_raw": 1947912,
        "daily_vaccinations": 1267521,
        "total_vaccinations_per_hundred": 22.22,
        "people_vaccinated_per_hundred": 14.1,
        "people_fully_vaccinated_per_hundred": 9.13,
        "total_boosters_per_hundred": 0.06,
        "daily_vaccinations_per_million": 923,
        "daily_people_vaccinated": 688914,
        "daily_people_vaccinated_per_hundred": 0.05
      }
    ]
  },
  .
  .
  {
    "country": "Zimbabwe",
    "iso_code": "ZWE",
    "data": [
      {
        "date": "2021-02-18",
        "total_vaccinations": 39,
        "people_vaccinated": 39,
        "total_vaccinations_per_hundred": 0.0,
        "people_vaccinated_per_hundred": 0.0
      },
      .
      .
      {
        "date": "2021-12-29",
        "total_vaccinations": 7234293,
        "people_vaccinated": 4112517,
        "people_fully_vaccinated": 3121776,
        "daily_vaccinations_raw": 11952,
        "daily_vaccinations": 9526,
        "total_vaccinations_per_hundred": 47.93,
        "people_vaccinated_per_hundred": 27.25,
        "people_fully_vaccinated_per_hundred": 20.68,
        "daily_vaccinations_per_million": 631,
        "daily_people_vaccinated": 4796,
        "daily_people_vaccinated_per_hundred": 0.032
      }
    ]
  }
]

I can use

url = "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.json"
r = requests.get(url)
df = pd.json_normalize(
    j,
    record_path = ['data'],
    errors = 'ignore',
    meta = ['iso_code','country']
)

The obtained dataframe works like expected here.

With some minor edits of column names, I can run/get this:

df[(df['iso_code']=='DNK')][['total_pct','fully_pct','boost_pct']]

total_pct   fully_pct   boost_pct
date            
2020-12-01  0.00    NaN NaN
2020-12-02  NaN NaN NaN
2020-12-03  NaN NaN NaN
2020-12-04  NaN NaN NaN
2020-12-05  NaN NaN NaN
... ... ... ...
2021-12-24  82.29   78.03   42.34
2021-12-25  82.30   78.03   42.65
2021-12-26  82.33   78.06   43.12
2021-12-27  82.44   78.26   44.70
2021-12-28  82.52   78.43   46.15

I can't figure out why the full dataset can't be normalized when the vaccination dataset can. Obviously, the structure is a bit different. The full dataset has a dictat the top level, while the vaccination dataset is a list at the top level. Would that be somehow the explanation, and how do I overcome that?

OWID: Full data

OWID: Vaccination data

score 1 · Answer 1 · answered Dec 29 '21 at 23:03

1

This link might be helpful - they discuss the similar topic.

https://stackoverflow.com/questions/61641632/pandas-json-normalize-to-flatten-a-dictionary-with-values-as-columns

answered Dec 29 '21 at 23:03

Shawn Jung

21
2

How do I normalize json data into pandas (Covid-19 data)

1 Answers1