Streaming Large JSON Files
2024-03-17
It has occurred to me multiple times that I need to read a JSON file, for whatever reason, and it’s simply too big in size. Since reading everything into the memory sounds like a terrible idea, an intuitive alternative would be to “stream” it through an interator. This is exactly what the ijson
package does.
Say have a gigantic JSON file in the following structure
{
"earth": {
"europe": [
{"name": "Paris", "type": "city", "info": { ... }},
{"name": "Thames", "type": "river", "info": { ... }},
// ...
],
"america": [
{"name": "Texas", "type": "state", "info": { ... }},
// ...
]
}
}
With the help of ijson
we can iterate through the cities in Europe like this:
import ijson
with open('filename.json', 'r') as f:
for item in ijson.items(f, 'earth.europe.item'):
yield item
As the example above suggests, the ijson.items
function takes two parameters, with the first being the file object and the second the prefix string. The prefix string is a dot-sequence of JSON keys/nodes ending with the leaf node item
, and falls back to a simple item
. For example, say we have a simple list of smaller JSONs inside a file:
[
{"name": "Al", "age": 32},
{"name": "Bob", "age": 25},
{"name": "Candy", "age": 22},
{"name": "Dylan", "age": 41},
// ...
]
We can use item
as the prefix and stream it one by one without consuming all the available memory:
with open('filename.json', 'r') as f:
for item in ijson.items(f, 'item'):
yield item