Ga direct naar


Cache a large array: JSON, serialize or var_export?

Monday 06 July 2009 10:30

While developing software like our framework you will need to cache a large data array to a file at some point sooner or later. At such a point you need to choose what caching method you will be using. In this article I will compare three methods: JSON, serialization and var_export() combined with include().

By Taco van den Broek

Too curious? Jump right to the results!

JSON

The JSON method uses the json_encode and json_decode functions. The JSON-encoded data is stored as is into a plain text file.

Code example

// Store cache
file_put_contents($cachePath, json_encode($myDataArray));
// Retrieve cache
$myDataArray = json_decode(file_get_contents($cachePath));

pros

  • Pretty easy to read when encoded
  • Can easily be used outside a PHP application

cons

  • Only works with UTF-8 encoded data
  • Will not work with objects other than instances of the stdClass class.

Serialization

The serialization method uses the serialize and unserialize functions. The serialized data is, just like the JSON data, stored as is into a plain text file.

Code example

// Store cache
file_put_contents($cachePath, serialize($myDataArray));
// Retrieve cache
$myDataArray = unserialize(file_get_contents($cachePath));

pros

  • Does not need the data to be UTF-8 encoded
  • Works with instances of classes other than the stdClass class.

cons

  • Nearly impossible to read when encoded
  • Can not be used outside of a PHP application, without having to write custom functions

Var_export

This method 'encodes' the data using var_export and loads the data using the include statement (no need for file_get_contents!). The encoded data needs to be in a valid PHP file so we wrap the encoded data in the following PHP code:

<?php
return /*var_export output goes here*/;

Code example

// Store cache
file_put_contents($cachePath, "<?php\nreturn " . var_export($myDataArray, true) . ";");
// Retrieve cache
$myDataArray = include($cachePath);

pros

  • No need for UTF-8 encoding
  • Is very readable (assuming you can read PHP code)
  • Retrieving the cache uses one language construct instead of two functions
  • When using an opcode cache your cache file will be stored in the opcode cache. (This is actually a disadvantage, see the cons list).

cons

  • Needs PHP wrapper code.
  • Can not encode Objects of classes missing the __set_state method.
  • When using an opcode cache your cache file will be stored in the opcode cache. If you do not need a persistant cache this is useless, most opcode caches support storing values in the shared memory. If you don't mind storing the cache in memory, use the shared memory without writing the cache to disk first.
  • Another disadvantage is that your stored file has to be valid PHP. If it contains a parse error (which could happen when your script crashes while writing the cache) your application will not work anymore.

Benchmark

In my benchmark I used 5 different data sets with different sizes (measured in memory usage): 904B, ~18kB, ~250kB, ~4.5MB and ~72.5MB. For each of these data sets I did the following routine for each encoding method:

  1. Encode the data 10 times
  2. Calculate the string length of the encoded data
  3. Decode the encoded data 10 times

Results

Yay, results! In the result tables you see the length of the encoded string, the total time used for encoding and the total time used for decoding. The benchmark was done on my laptop: 2.53GHz, 4GB, Ubuntu linux, PHP 5.3.0RC4.

904 B array
JSON Serialization var_export / include
Length 105 150 151
Encoding 0.0000660419464111 0.00004696846008301 0.00014996528625488
Decoding 0.0011160373687744 0.00092697143554688 0.0010221004486084
18.07 kB array JSON Serialization var_export / include
Length 1965 2790 3103
Encoding 0.0005040168762207 0.00035905838012695 0.001352071762085
Decoding 0.0017290115356445 0.0011298656463623 0.0056741237640381
290.59 kB array JSON Serialization var_export / include
Length 31725 45030 58015
Encoding 0.0076849460601807 0.0057480335235596 0.02099609375
Decoding 0.014955997467041 0.010177850723267 0.030472993850708
4.54 MB array JSON Serialization var_export / include
Length 507885 720870 1059487
Encoding 0.13873195648193 0.11841702461243 0.38376498222351
Decoding 0.29870986938477 0.21590781211853 0.53850317001343
72.67 MB array JSON Serialization var_export / include
Length 8126445 11534310 19049119
Encoding 2.3055040836334 2.7609040737152 6.2211949825287
Decoding 4.5191099643707 8.351490020752 8.7873070240021

We've done the same benchmark on eight other machines including Windows and Mac OS machines and some webservers running Debian. Some of these machines had PHP 5.2.9 installed, others already switched to 5.3.0. All had the same (relative) results, except for a macbook in which serialize was faster encoding the largest dataset.

Conclusion

As you can see the var_export (without opcode cache!) method doesn't come out that well and serialize seems to be the overall winner. What bothered me though was the largest dataset in which JSON became faster than serialize. Wondering whether this was a glitch or a trend I fired up my OpenOffice spreadsheet and created some charts:

The charts show the relative speed of each method compared to the fastest method (so 100% is the best a method can do). As you can see both JSON and var_export become relatively faster when the data set gets big (arrays of 70MB and bigger? Maybe you should reconsider the structure of your data set :)). So when using a sane sized data array: use serialize. When you want to go crazy with large data sets: use anything you like, disk i/o will become your bottleneck.

Reactions on "Cache a large array: JSON, serialize or var_export?"

garfix
Placed on: 07-09-2009 16:30
Patrick van Bergen
User icon
to be continuum
Good job, Taco. Wish php.net had these kinds of stats.
Geert
Placed on: 08-04-2009 10:16
Very useful benchmarks. Thanks.
Ries van Twisk
Placed on: 08-13-2009 04:52
Do you happen to have any results where you have used an opcode cache?

I can only imagine that with an upcode cache the var_export method is faster. Pure theoretically this would mean that with an include the data is 'there' and shouldn't have to be parsed anymore.

Ries
Peter Farkas
Placed on: 09-29-2009 16:56
This style is the one I like so much!
Thank you!
Brilliant work!
Vasilis
Placed on: 01-15-2010 10:27
Great info man... I like benchmarks! Thank you
Nice work!
Placed on: 03-24-2010 17:35
Thanks a million - refreshing to see solid content.

Concise and well documented, perfect.
Frank Denis
Placed on: 05-13-2010 20:47
If speed and size matters, igbinary beats all of these hands down: http://opensource.dynamoid.com/
cws1989
Placed on: 08-25-2010 05:13
Thanks for the pros and cons of different methods and the benchmark, it helps very much. But you've just mentioned about the size of the data set, may I ask how about the complexity (e.g. how many elements in the array, how deep is it) of the tested data? Does it proportional to the size of the data set? Thanks!
GDR!
Placed on: 01-18-2011 13:56
Dude, benchmarks should be done in loops. 0.00004696846008301s is a number that's very close to measurement error. The proper way to do this test would be to run unserialize and friends 100000 times in loop, measure time of whole execution and then dvide by 100 000.

You can see what I'm talking about if you run your benchmark 10 times and probably will get very differing results every time. With looped benchmark it wouldn't happen.
Taco
Placed on: 01-30-2011 10:26
@GDR! Hmm, looks like I didn't mention the loop amount in this test. I don't remember the exact amount of loops I used but it would probably have been around a 1000.

So what I did was something like this:

$start = microtime(true);
for ($i = 0; $i < 1000; $i++) {
for ($j = 0; $j < 10; $j++) {
$encoded = encode_method($data);
}
$len = strlen($encoded);
for ($j = 0; $j < 10; $j++) {
$decoded = decode_method($encoded);
}
}
$end = microtime(true);
$avgDuration = ($end - $start) / 1000;
cachecache
Placed on: 07-03-2011 02:25
If you have to cache a big array ( let's say > 1 Mb serialized), the unserialize process time will grow exponentially.
If possible you can use this hint : serialize each record. The process of unserialization for all records (the entire array) will be way faster...
Mateusz Krzeszowiak
Placed on: 07-26-2011 23:20
Hi! I've read somewhere that fopen is the fastest way to write files and file_get_contents to read. Can you make a test that includes this functions for serialization methods? Smile

Log in to comment on news articles.

Procurios zoekt PHP webdevelopers. Werk aan het Procurios Webplatform en klantprojecten! Zie http://www.slimmerwerkenbijprocurios.nl/.


Snelkoppelingen