allenfrostline

Streaming Large JSON Files


2024-03-17

It has occurred to me multiple times that I need to read a JSON file, for whatever reason, and it’s simply too big in size. Since reading everything into the memory sounds like a terrible idea, an intuitive alternative would be to “stream” it through an interator. This is exactly what the ijson package does.

Say have a gigantic JSON file in the following structure

{
  "earth": {
    "europe": [
      {"name": "Paris", "type": "city", "info": { ... }},
      {"name": "Thames", "type": "river", "info": { ... }},
      // ...
    ],
    "america": [
      {"name": "Texas", "type": "state", "info": { ... }},
      // ...
    ]
  }
}

With the help of ijson we can iterate through the cities in Europe like this:

import ijson 

with open('filename.json', 'r') as f:
    for item in ijson.items(f, 'earth.europe.item'):
        yield item

As the example above suggests, the ijson.items function takes two parameters, with the first being the file object and the second the prefix string. The prefix string is a dot-sequence of JSON keys/nodes ending with the leaf node item, and falls back to a simple item. For example, say we have a simple list of smaller JSONs inside a file:

[
  {"name": "Al", "age": 32},
  {"name": "Bob", "age": 25},
  {"name": "Candy", "age": 22},
  {"name": "Dylan", "age": 41},
  // ...
]

We can use item as the prefix and stream it one by one without consuming all the available memory:

with open('filename.json', 'r') as f:
    for item in ijson.items(f, 'item'):
        yield item