I am trying to import all up-to-date datasets in JSON format on the covid-19 pandemic into a pandas dataframe. I believe it should be possible by using json_normalize but I can't make it work.
First, I initialize and import the json data:
import pandas as pd
import numpy as np
import requests
import json
from pandas.io.json import json_normalize
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.json'
r = requests.get(url)
json = r.json()
The type of this JSON is dict.
When I inspect the json, I can see a complicated, nested structure. First, there is a dict that consists of one dict for each location (using the names of "iso_code"). These dicts have some overall statistics as key/values and the the "data" dict containing the time series as a list of dicts. I hope I got it correct...
The output is one long line from which I can show the very first snip:
{"AFG":{"continent":"Asia","location":"Afghanistan","population":39835428.0,"population_density":54.422,"median_age":18.6,"aged_65_older":2.581,"aged_70_older":1.337,"gdp_per_capita":1803.987,"cardiovasc_death_rate":597.029,"diabetes_prevalence":9.59,"handwashing_facilities":37.746,"hospital_beds_per_thousand":0.5,"life_expectancy":64.83,"human_development_index":0.511,"data":[{"date":"2020-02-24","total_cases":5.0,"new_cases":5.0,"total_cases_per_million":0.126,"new_cases_per_million":0.126,"stringency_index":8.33},{"date":"2020-02-25",
For readability, I have formatted a snip with line breaks and indentation:
{
"AFG":{
"continent":"Asia",
"location":"Afghanistan",
"population":39835428.0,
"population_density":54.422,
"median_age":18.6,
"aged_65_older":2.581,
"aged_70_older":1.337,
"gdp_per_capita":1803.987,
"cardiovasc_death_rate":597.029,
"diabetes_prevalence":9.59,
"handwashing_facilities":37.746,
"hospital_beds_per_thousand":0.5,
"life_expectancy":64.83,
"human_development_index":0.511,
"data":[
{
"date":"2020-02-24",
"total_cases":5.0,
"new_cases":5.0,
"total_cases_per_million":0.126,
"new_cases_per_million":0.126,
"stringency_index":8.33
},
. .
.
{
"date":"2021-12-27",
"total_cases":157967.0,
"new_cases":16.0,
"new_cases_smoothed":24.286,
"total_deaths":7354.0,
"new_deaths":0.0,
"new_deaths_smoothed":2.714,
"total_cases_per_million":3965.49,
"new_cases_per_million":0.402,
"new_cases_smoothed_per_million":0.61,
"total_deaths_per_million":184.61,
"new_deaths_per_million":0.0,
"new_deaths_smoothed_per_million":0.068
}
]
},
.
.
"ZWE":{
"continent":"Africa",
"location":"Zimbabwe",
"population":14862927.0,
"population_density":42.729,
"median_age":19.6
"aged_65_older":2.822,
"aged_70_older":1.882,
"gdp_per_capita":1899.775,
"extreme_poverty":21.4,
"cardiovasc_death_rate":307.846,
"diabetes_prevalence":1.82,
"female_smokers":1.6,
"male_smokers":30.7,
"handwashing_facilities":36.791,
"hospital_beds_per_thousand":1.7,
"life_expectancy":61.49,
"human_development_index":0.571,
"data":[
{
"date":"2020-03-23",
"total_cases":3.0,
"new_cases":0.0,
"total_deaths":1.0,
"new_deaths":1.0,
"total_cases_per_million":0.202,
"new_cases_per_million":0.0,
"total_deaths_per_million":0.067,
"new_deaths_per_million":0.067,
"stringency_index":45.37
},
.
.
{
"date":"2021-12-28",
"total_cases":207548.0,
"new_cases":2099.0,
"new_cases_smoothed":1397.143,
"total_deaths":4940.0,
"new_deaths":32.0,
"new_deaths_smoothed":17.286,
"total_cases_per_million":13752.031,
"new_cases_per_million":139.079,
"new_cases_smoothed_per_million":92.574,
"total_deaths_per_million":327.322,
"new_deaths_per_million":2.12,
"new_deaths_smoothed_per_million":1.145,
"total_vaccinations":7222341.0,
"people_vaccinated":4107151.0,
"people_fully_vaccinated":3115190.0,
"new_vaccinations":3903.0,
"new_vaccinations_smoothed":10056.0,
"total_vaccinations_per_hundred":47.85,
"people_vaccinated_per_hundred":27.21,
"people_fully_vaccinated_per_hundred":20.64,
"new_vaccinations_smoothed_per_million":666.0,
"new_people_vaccinated_smoothed":5407.0,
"new_people_vaccinated_smoothed_per_hundred":0.036
}
]
}
}
My goal is in the first place to be able to access the time series for e.g. "new_cases_per_million" for each location. Further on, I could imagine to relate e.g. "total_cases_per_million" or "new_vaccinations_smoothed_per_million" to "gdp_per_capita" and so on.
As the actual time series are nested into the "data" dict, I tried something like this:
data = pd.json_normalize(json,record_path=['data'],errors='ignore')
which gives the error:
KeyError Traceback (most recent call last)
<ipython-input-20-602e9c355c71> in <module>
----> 1 data = pd.json_normalize(json,record_path=['data'],errors='ignore')
/usr/local/lib/python3.8/site-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
334 records.extend(recs)
335
--> 336 _recursive_extract(data, record_path, {}, level=0)
337
338 result = DataFrame(records)
/usr/local/lib/python3.8/site-packages/pandas/io/json/_normalize.py in _recursive_extract(data, path, seen_meta, level)
307 else:
308 for obj in data:
--> 309 recs = _pull_records(obj, path[0])
310 recs = [
311 nested_to_record(r, sep=sep, max_level=max_level)
/usr/local/lib/python3.8/site-packages/pandas/io/json/_normalize.py in _pull_records(js, spec)
246 if has non iterable value.
247 """
--> 248 result = _pull_field(js, spec)
249
250 # GH 31507 GH 30145, GH 26284 if result is not list, raise TypeError if not
/usr/local/lib/python3.8/site-packages/pandas/io/json/_normalize.py in _pull_field(js, spec)
237 result = result[field]
238 else:
--> 239 result = result[spec]
240 return result
241
KeyError: 'data'
I can run the command without setting the record_path like this:
data = pd.json_normalize(json)
but this, of course, does not expand the nested data.
I have another pandas project where I import vaccination data as json from owid (see below):
The json data is structured slightly different.
Formatted with line breaks and indentation, it looks like:
[
{
"country": "Afghanistan",
"iso_code": "AFG",
"data": [
{
"date": "2021-02-22",
"total_vaccinations": 0,
"people_vaccinated": 0,
"total_vaccinations_per_hundred": 0.0,
"people_vaccinated_per_hundred": 0.0
},
.
.
{
"date": "2021-12-30",
"total_vaccinations": 305216136,
"people_vaccinated": 193716017,
"people_fully_vaccinated": 125428809,
"total_boosters": 860635,
"daily_vaccinations_raw": 1947912,
"daily_vaccinations": 1267521,
"total_vaccinations_per_hundred": 22.22,
"people_vaccinated_per_hundred": 14.1,
"people_fully_vaccinated_per_hundred": 9.13,
"total_boosters_per_hundred": 0.06,
"daily_vaccinations_per_million": 923,
"daily_people_vaccinated": 688914,
"daily_people_vaccinated_per_hundred": 0.05
}
]
},
.
.
{
"country": "Zimbabwe",
"iso_code": "ZWE",
"data": [
{
"date": "2021-02-18",
"total_vaccinations": 39,
"people_vaccinated": 39,
"total_vaccinations_per_hundred": 0.0,
"people_vaccinated_per_hundred": 0.0
},
.
.
{
"date": "2021-12-29",
"total_vaccinations": 7234293,
"people_vaccinated": 4112517,
"people_fully_vaccinated": 3121776,
"daily_vaccinations_raw": 11952,
"daily_vaccinations": 9526,
"total_vaccinations_per_hundred": 47.93,
"people_vaccinated_per_hundred": 27.25,
"people_fully_vaccinated_per_hundred": 20.68,
"daily_vaccinations_per_million": 631,
"daily_people_vaccinated": 4796,
"daily_people_vaccinated_per_hundred": 0.032
}
]
}
]
I can use
url = "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.json"
r = requests.get(url)
df = pd.json_normalize(
j,
record_path = ['data'],
errors = 'ignore',
meta = ['iso_code','country']
)
The obtained dataframe works like expected here.
With some minor edits of column names, I can run/get this:
df[(df['iso_code']=='DNK')][['total_pct','fully_pct','boost_pct']]
total_pct fully_pct boost_pct
date
2020-12-01 0.00 NaN NaN
2020-12-02 NaN NaN NaN
2020-12-03 NaN NaN NaN
2020-12-04 NaN NaN NaN
2020-12-05 NaN NaN NaN
... ... ... ...
2021-12-24 82.29 78.03 42.34
2021-12-25 82.30 78.03 42.65
2021-12-26 82.33 78.06 43.12
2021-12-27 82.44 78.26 44.70
2021-12-28 82.52 78.43 46.15
I can't figure out why the full dataset can't be normalized when the vaccination dataset can. Obviously, the structure is a bit different. The full dataset has a dictat the top level, while the vaccination dataset is a list at the top level.
Would that be somehow the explanation, and how do I overcome that?