Python and the JSON dict

written by Henrik Blidh on 2016-10-15

I use the JSON format almost every day, favouring it every time above XML and other representations of data. I love its simplicity when serializing and I love that JSON can be represented by the dict data structure in Python and retain its JSON feeling.

Recently, I wrote some small scripts for parsing large XML files into dictionaries and eventually into JSON. I had heard from someone that parsing an entire XML document with the xml standard library methods would lead to extreme memory overhead and I wanted to explore this. It eventually led to the creation of the xmlr package, and in writing and profiling it I found several interesting aspects of memory usage surrounding XML/JSON/dict representation.

Regard the following JSON document (obtained from json.org):

doc = """
{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}
"""

If we load this into memory using the json package in Python, we get it represented as nested dictionaries. Let's do that and perform some memory size measurements:

import os
import sys
import json

as_dict = json.loads(doc)
as_minified_json = json.dumps(as_dict)

d_1 = calculate_document_size_in_memory(as_dict)
d_2 = sys.getsizeof(as_minified_json)
d_3_tmp_file = '/tmp/d_3_file.json'
with open(d_3_tmp_file, 'w') as f:
    json.dump(as_dict, f)
d_3 = os.path.getsize(d_3_tmp_file)
os.remove(d_3_tmp_file)

print("Size in memory as dict:              {0:>6d} B".format(d_1))
print("Size in memory as json.dumps str:    {0:>6d} B".format(d_2))
print("Size on disc as json.dump in file:   {0:>6d} B".format(d_3))

The method calculate_document_size_in_memory is a small method I have written to iterate over a nested dictionary containing data parsed from JSON and estimating their collected size in memory:

Running the code above yields the following output:

Size in memory as dict:                4180 B
Size in memory as json.dumps str:       422 B
Size on disc as json.dump in file:      385 B

The data as a dict takes ten times as much space in memory as the string representation of it! This is due to the fact that each dict has an overhead of 280 bytes (run sys.getsizeof({}) in a Python terminal to see this; it also differs a bit between Python 2 and 3 and the overhead only half as large on 32-bit installations), and since a JSON document is very often composed of many nested dictionaries with little content in each, these small sums add up quite rapidly. Let's look at a larger document!

The U.S. copyright renewal records available for download provides a XML document that is ~370 MB in size. Using the xmlr package to parse the XML and running the following code:

import os
import sys
import json

from xmlr import xmlparse

filepath = '/home/hbldh/Downloads/google-renewals-all-20080624.xml'

doc = xmlparse(filepath)
as_minified_json = json.dumps(doc)

d_0 = os.path.getsize(filepath)
d_1 = calculate_document_size_in_memory(doc)
d_2 = sys.getsizeof(as_minified_json)
d_3_tmp_file = '/tmp/d_2_file.json'
with open(d_3_tmp_file, 'w') as f:
    json.dump(doc, f)
d_3 = os.path.getsize(d_3_tmp_file)
os.remove(d_3_tmp_file)

print("Size on disc as xml:                  {0:>10d} B".format(d_0))
print("Size in memory as dict:               {0:>10d} B".format(d_1))
print("Size in memory as json.dumps str:     {0:>10d} B".format(d_2))
print("Size on disc as json.dump in file :   {0:>10d} B".format(d_3))

We get the following output:

Size on disc as xml:                  389648553 B
Size in memory as dict:              2459272532 B
Size in memory as json.dumps str:     315007253 B
Size on disc as json.dump in file:    315007216 B

From a 370 MB size on disc as an XML document to 2.3 GB of memory, and this in very large part due to overhead in massive amounts of dicts! There are some overhead for each string as well, and given that all keys are strings and most values are as well, we can see that it is advisable to be cautious when handling large JSON documents in their Python representation. Reading the XML document to memory using the xml.etree.ElementTree.parse method is also a bad idea; it used all of the available 6 GB of memory on my machine and some of the swap as well I believe since the computer started lagging.

Keeping the document, or subdocuments at least, as a json.dumps string until data retrieval in them is desired might be a good idea in case one needs to either minimize memory usage or maximize possible data size to parse. It is a shame that a manageable amount of data (370 MB) should have to become potentially unmanageable on a moderately priced cloud instance computer.

Should I perhaps write a small abstraction layer to provide a dict-similar object that handles the data in JSON string form in the background? Fun challenge!

The code used for this blog post can be found as a gist here.