Monday, September 7, 2009

PHP Serialization And Recursion Demystified

Introduction

PHP has different in-core callbacks able to help us with daily deployment, debug, improvements. At the same time, PHP is loads of intrinsic "gotcha", too often hard to understand, hard to explain, or simply hard to manage. One common problem is about debug, caching, or freezing, and the way we would like to debug, cache, or freeze, variables.
For freezing, I mean those procedures able to regenerate a stored variable and its status, in order to reuse that variable, to understand what happened in that moment with that variable, or just to speed up expensive tasks already completed.


The Problem

One of the most common procedures to freeze variables is their serialization, performed in core via a well known serialize function.
Please consider this example:

$person = new Employee('Mr. Lucky Me');
// ... do some useful task
myCompanyFreezer($person);

// the myCompanyFreezer function

function myCompanyFreezer(Employee $p){

$company = Company::getInstance();
// note that this company has exclusive
// control over the employee work (reference)
$company->employees[] = &$p;

// on the other hand employee
// has finally a company to work with
// but no control over the company
$p->company = $company;

// update and freeze the employee status
$company->add(serialize($p));
}

So, while company has an exclusive contract, and each employee is totally under company control, the employee has nothing to do with company decisions, but it can proudly say: "Look at me, I work for Company::getInstance()!".
But being serialization recursive, we will find the company instance present as employee "company" property.
The problem is that the company instance has an "employees" property which contain one or more employees, included the employee Mr. Lucky Me.
And so on and on until infinite recursions, a massive waste of resources and ... ALT!, serialize is clever enough to understand when there are recursions and rather than going on with nested serializations it simply put a reference to the serialized object.
Got headache already?

Two Different Kind Of Recursions

Being serialize main purpose to freeze a variable status, and being PHP still a bit hybrid about references and shadow copies, serialize could produce two kinds of pointer: r and R.
The lowercase "r" will be a recursion, while the uppercase "R" will be a recursion by reference.

// serialized recursion - the ABC
$o = new stdClass;

// recursion
$o->normal = $o;

// recursion by reference
$o->reference = &$o;

echo serialize($o);
// O:8:"stdClass":2:{s:6:"normal";r:1;s:9:"reference";R:1;}

We should focus into r:1; and R:1;.
While the "r", or the "R", means there is a recursion, the unsigned integer indicates the exact object that "caused" that recursion.
When we perform an unserialize operation, the parser cannot obviously de-serialize as we read, because if we have an instance or an array, internal values should be ready to be assigned already "unserialized".
This simply means that the number after the R is not sequential, and there is no relation with the length of the string, but only a relation with de-serialization process.

What Is WrongWith Serialize

First of all, PHP serialization is not human readable as JSON, as example, or an XML is.
If we use this format to debug our application we'll definitively need an extra layer able "to introduce" us the object as is. In few words, what we need is something that is not serialized.
Moreover, serialize and unserialize would like to be as much reliable as possible, and for these reasons these functions are 3 times slower than json_encode or json_decode.
The truth is that JSON, as is, cannot compete with serialize and unserialize, due to protocol simplicity which is unable to store class names, lambdas, or public, protected, and private instances properties.
Last, but not least, JSON PHP parsers are a bit ambiguous, because an array not fully populated is usually converted into an object:

define('MY_WELCOME_STRING', 1);
$a = array();
$a[MY_WELCOME_STRING] = 'Hello World';

echo json_encode($a);
//{"1":"Hello World"}

// in JavaScript would have been
// [null,"Hello World"]
// where square brackets mean Array, and not Object

So again, another serializer is not worth it to freeze variables, what's left for us?

var_export

var_export() gets structured information about the given variable. It is similar to var_dump() with one exception: the returned representation is valid PHP code.

EUREKA! There is a core level function which aims is to serialize PHP into valid PHP, how can we ask something more efficient? I mean: "native performances to serialize and native performances to have back, it must be the solution"!
It's not!

$o = new stdClass;
$o->normal = $o;

echo var_export($o);

Fatal error: Nesting level too deep - recursive dependency?

Nice one! From bogus 39116 and Derick reply:
We can't change this by adding some text when this happens, as that
would not result in valid PHP code in that case (which is the purpose of
this function
).

Let me summarize:
  1. serialize/unserialize understand recursions almost without problems but unserialize is slow and both are PHP dedicated

  2. json_encode is not compatible with recursion, and as general purpose PHP serializer, it looses too much PHP information
  3. var_export would be perfect but in PHP we cannot manually represent a recursion that will be valid and correctly parsed
  4. var_dump is magic but its produced output is not reliable, *RECURSION* won't be recognized as valid PHP value
  5. I had already headache at line 10 of this post, and now I am still here to see there are no solutions?


How To Remove Recursion Without Loosing It

Well, solutions are different, but performances speaking, we do not have too many chances. A first solution could be a maximum nested level limit, where an object cannot serialize its properties "forever" and after N times it has to stop!
This technique has more cons than pros, and reasons are these:
  • it could require a manual parser, slower, and due to the problem nature, not that simple to maintain or debug
  • it could be extremely redundant, causing a lot of wasted resources, due to its artificial stupidity, since a recursion should never be serialized, being indeed a recursion, and in this way a waste of time, references, and resources
  • as mentioned 5 words ago, in this way we are loosing the recursion, so we should stop saying we are serializing ...
Accordingly, there is only another chance to perform this task: understand recursions, and remove them without loosing their meaning.

$o = new stdClass;
$o->n = $o;
$o->r = &$o;

echo serialize($o), '
',
serialize(
remove_recursion($o)
)
;

Produced output:

O:8:"stdClass":2:{s:1:"n";r:1;s:1:"r";R:1;}
O:8:"stdClass":2:{s:1:"n";s:12:"?recursion_1";s:1:"r";s:12:"?Recursion_1";}

Et voilĂ ! problem solved! ... but what is that?
The remove_recursion function has been introduced in latest Formaldehyde Project Version 1.05, and its purpose is to make debuggable any kind of trace, backtrace, or logged information.
The resulting var_export will be something like this:

stdClass::__set_state(array(
'n' => '' . "\0" . 'recursion_1',
'r' => '' . "\0" . 'Recursion_1',
))

The chosen form to store a recursion is exactly the same used by PHP for lambdas

echo serialize(create_function('',''));
//s:10:"?lambda_1";

In PHP a lambda is stored as "protected" string, and the number at the end of the string "lambda_" indicates its reference. Until we restart our webserver, lambda functions will persist in the entire PHP context, that is why it is possible to serialize lambda functions and unserialize them, as long as the environment does not change, or restart.
The additional difference between "r" and "R" in case of recursion is necessary to avoid info about references.
On the other hand, recursions are truly useless to debug or export variables, but they can always be present.
PHP will not understand my chosen syntax, but only and if necessary, we can always use a function like this to recreate correct recursions:

function recreate_recursion($o){
return unserialize(
preg_replace(
'#s:[0-9]+:"\x00(r|R)ecursion_([0-9]+)";#',
'\1:\2;',
serialize($o)
)
);
}


Pros

  1. we can finally forget every kind of recursion problem, letting PHP understand them via serialize, without doing anything
  2. performances and produced size will be better than every other nested based parser, thanks to a simple parser which ... surprise!!! ... it does not use recursion at all!
  3. once we pass a variable through formaldehyde_remove_recursion we can transform that kind of variable in whatever format, included var_export, JSON and XML, forgetting recursions headaches


Cons

  1. being based over serialize and unserialize, the transformation could implicitly call, if present, both __sleep and __wakeup events, it's gonna happen in any case if we use serialize/unserialize, but if we serialize a transformed variable __sleep will be called twice
  2. it could require extra effort to regenerate internal recursions, in any case it is better than loose them forever as most of us have done 'till now
  3. the convertion is assumining that a serialized string will not contain an exact match, such a manual string. This is actually the same assumption PHP developers did about serialized lambdas.


Conclusion

With a lightweight function, and after this post, I hope we can better understand recursion problems, and relative serializations. My suggestion is to give Formaldehyde a try, but as long as the Mit Style License is respected, you can extract its internal formaldehyde_remove_recursion.

Any question? :)

No comments:

Post a Comment