SitePoint PHP blog

How to Expose PHP’s Private Parts
I’ve been tinkering with dumping PHP objects, and have found myself constantly running into a brick wall. The output from print_r and friends is fine in some contexts, but for larger structures, it would be nice to tidy the output up a bit and wrap it in some HTML.
However, these functions have a certain privilege in that they can access private/protected variables. This is not something that can be circumvented in plain PHP (save through some exotic extensions, which I’d rather not rely on). It seems that 5.3.0 may introduce a functionality in the Reflection API, but nobody knows when 5.3.0 will be out.
So I realised that it is possible to serialize any object to a string, which will include private and protected variables alike. All I had to do was then to parse the serialized string, and I’d gain that arcane insight. One would perhaps assume that such a parser already exists. Maybe it does, but I couldn’t find it (I did find a Java implementation though). The format is fairly simple though, so I spend half a Sunday afternoon on it. One very nice bonus, compared to print_r is that the format handles recursion, which is very handy
for dumping the output in a presentable form.
So, without further ado, here’s the code:
< ?php /** * Exports variable information, including private/protected variables and with recursion-protection. * Since this is built upon PHP serialization functionality, unserializable objects may cause trouble. */ class XrayVision { protected $id; function export($object) { $this->id = 1; list($value, $input) = $this->parse(serialize($object)); return $value; } protected function parse($input) { if (substr($input, 0, 2) === 'N;') { return array(array('type' => 'null', 'id' => $this->id++, 'value' => null), substr($input, 2)); } $pos = strpos($input, ':'); $type = substr($input, 0, $pos); $input = substr($input, $pos + 1); switch ($type) { case 's': return $this->s($input); case 'i': return $this->i($input); case 'd': return $this->d($input); case 'b': return $this->b($input); case 'O': return $this->o($input); case 'a': return $this->a($input); case 'r': return $this->r($input); } throw new Exception("Unhandled type '$type'"); } protected function s($input) { $pos = strpos($input, ':'); $length = substr($input, 0, $pos); $input = substr($input, $pos + 1); $value = substr($input, 1, $length); return array(array('type' => 'string', 'id' => $this->id++, 'value' => $value), substr($input, $length + 3)); } protected function i($input) { $pos = strpos($input, ';'); $value = (integer) substr($input, 0, $pos); return array(array('type' => 'integer', 'id' => $this->id++, 'value' => $value), substr($input, $pos + 1)); } protected function d($input) { $pos = strpos($input, ';'); $value = (float) substr($input, 0, $pos); return array(array('type' => 'float', 'id' => $this->id++, 'value' => $value), substr($input, $pos + 1)); } protected function b($input) { $pos = strpos($input, ';'); $value = substr($input, 0, $pos) === '1'; return array(array('type' => 'boolean', 'id' => $this->id++, 'value' => $value), substr($input, $pos + 1)); } protected function r($input) { $pos = strpos($input, ';'); $value = (integer) substr($input, 0, $pos); return array(array('type' => 'recursion', 'id' => $this->id++, 'value' => $value), substr($input, $pos + 1)); } protected function o($input) { $id = $this->id++; $pos = strpos($input, ':'); $name_length = substr($input, 0, $pos); $input = substr($input, $pos + 1); $name = substr($input, 1, $name_length); $input = substr($input, $name_length + 3); $pos = strpos($input, ':'); $length = (int) substr($input, 0, $pos); $input = substr($input, $pos + 2); $values = array(); for ($ii=0; $ii < $length; $ii++) { list($key, $input) = $this->parse($input); $this->id--; list($value, $input) = $this->parse($input); if (substr($key['value'], 0, 3) === "\000*\000") { $values['protected:' . substr($key['value'], 3)] = $value; } elseif ($pos = strrpos($key['value'], "\000")) { $values['private:' . substr($key['value'], $pos + 1)] = $value; } else { $values[str_replace("\000", ':', $key['value'])] = $value; } } return array( array('type' => 'object', 'id' => $id, 'class' => $name, 'value' => $values), substr($input, 1)); } protected function a($input) { $id = $this->id++; $pos = strpos($input, ':'); $length = (int) substr($input, 0, $pos); $input = substr($input, $pos + 2); $values = array(); for ($ii=0; $ii < $length; $ii++) { list($key, $input) = $this->parse($input); $this->id--; list($value, $input) = $this->parse($input); $values[$key['value']] = $value; } return array( array('type' => 'array', 'id' => $id, 'value' => $values), substr($input, 1)); } } ?>There are at least two known issues with this technique. The first is that resources are serialized into integers. For a dump, this doesn’t really matter, since a resource is meaningless outside the running process. The other problem is with objects that implements __sleep. Since this function may have side-effects, you can potentially mess up your program for objects that use this feature. In my experience, it’s a seldom used functionality anyway, so it doesn’t really bother me that much.
DOM vs. Template
Fredrik Holmström recently posted a small template engine, based on DOM-manipulation. While there are certainly a lot of template engines around, I find this approach interesting. The concept is simple enough; The template is parsed into an object model (DOM), and then values can be assigned to these through PHP code. The main difference to traditional template engines (Such as Smarty), is that the template it self doesn’t have any imperatives within. In fact, the template doesn’t even have to be written to the template engine, to be used - Any markup can be used as a source.
Since the template can’t contain any view-logic, it ends up in a separate place (In PHP code). This makes the separation between presentation and logic airtight, which was the main idea of template engines in the first place. Another benefit is that since there is no string-level manipulation, it is virtually impossible to inadvertently get injections-type security breaches.
The template may be unaware of the view-logic, but the opposite can’t be said. To bind values to the template, the view-logic needs to be aware of the internal structure of the template. This means that if the template changes, so must the view-logic. To decouple this dependency, we need some kind of abstraction.
Luckily it just so happens that there is a very convenient mechanism for that; Element id’s can be used to address central nodes in the markup. They do however have the rather annoying limitation (For this use), that they must be globally unique to the document. A better candidate then, is to use classes (The HTML attribute - I’m not talking of PHP classes) to address elements.
The really nice thing about using classes is that it’s very unobtrusive to the markup. One will have to add classes, but since they would have to go on central elements in the markup, they would be prime candidates for reusing as fix points for CSS rules and for Javascript code. Instead of being superfluous markers in the HTML code, they actively help to write better markup.
That sounds good in theory, so to see how it holds out in reality, I mocked together a small prototype. Even with a very limited API, it has a remarkably good expressiveness:
Simple variable binding $t = new Domling('<p class="hello"></p>'); $t->capture('hello')->bind("Hello World"); echo $t->render(); <p class="hello">Hello World</p> Switching a block out $t = new Domling('<p>Lorem Ipsum</p><p class="message">Hidden message</p>'); $t->capture('message'); echo $t->render(); <p>Lorem Ipsum</p> Putting it back in $t = new Domling('<p>Lorem Ipsum</p><p class="message">Hidden message</p>'); $block = $t->capture('message'); $block->bind(); echo $t->render(); <p>Lorem Ipsum</p> <p class="message">Hidden message</p> And looping over a block $t = new Domling('<ul class="links"><li class="link"><a class="anchor" href="#">title</a></li></ul>'); $links = array( 'Sitepoint' => 'http://www.sitepoint.com', 'Example' => 'http://www.example.org?foo=bar&ding=dong'); foreach ($links as $title => $link) { $t->sequence('link', 'links')->bind(array('anchor:href' => $link, 'anchor' => $title)); } echo $t->render(); <ul class="links"> <li class="link"><a class="anchor" href="http://www.sitepoint.com">Sitepoint</a></li> <li class="link"><a class="anchor" href="http://www.example.org?foo=bar&ding=dong">Example</a></li> </ul>If you’re curious, you can get the full source code for the above examples from here: http://php.pastebin.com/f76ba8d70
But please mind that this is just a proof-of-concept; There are probably a few quirks that should be ironed out before this could be used in production.
Character Encoding: Issues with Cultural Integration
I’ve run into a classic problem with charsets, in an application I’m currently working on. As is the standard for PHP, all strings are treated as latin1, but we now need to allow a wider range of charsets in a few places.
The gold standard solution is to convert everything to utf-8. Since utf-8 covers the entire unicode range, it is capable of representing any character that latin1 can. Unfortunately, that’s a lot easier to do from the outset, than with a big, running application. And even then, there may be third party code and extensions, which assume latin1. I’d much rather continue with latin1 being the default, and only jump through hoops at the few places where I actually need full utf-8 capacity.
So after some thinking, another solution dawned on me. To be fair, hack is probably more descriptive than solution, but nonetheless. The idea goes as follows:
- Use latin1, but serve pages in utf-8, encoding it at output.
- Embed utf-8 strings within latin1, and somehow don’t encode it (But still encode everything else).
Simple, eh?
Latin1 on the inside, utf-8 on the outside.When rendering HTML pages, it is trivial to capture the output with an output buffer and pipe it through utf8_encode. The page is thus served in utf-8, even though everything internally is latin1. Not much gain in that, since it still restricts us to use the range of characters covered by latin1.
We are actually already doing this, simply to reduce the number of problems for external services communicating with our system. In particular, XmlHttpRequest defaults to utf-8, regardless of the page’s encoding.
In essence, the following snippet exemplifies:
// declare that the output will be in utf-8 header("Content-Type: text/html; charset=utf-8"); // open an output buffer, capturing all output ob_start('output_handler'); // when the script ends, the buffer is piped through this functions, encoding it from latin1 to utf-8 function output_handler($buffer) { return utf8_encode($buffer); } Embed utf-8 within latin1.This is the tricky part. Instead of simply piping the entire buffer through utf8_encode, the string can be parsed so anything between a set of special tags (Eg. [[charset:utf8]] ... [[/charset:utf8]]) is left as-is, while the rest is assumed to be latin1 and encoded with utf8_encode as before. This ensures full backwards compatibility, while allowing real utf-8.
Let’s modify our output-handler from before:
header("Content-Type: text/html; charset=utf-8"); ob_start('output_handler'); function output_handler($buffer) { return preg_replace_callback( '~\[\[charset:utf8\]\](.*?)\[\[/charset:utf8\]\]~', 'utf8_decode_first', utf8_encode($buffer)); } function utf8_decode_first($match) { return utf8_decode($match[1]); }And that’s it. We can now embed full utf-8 strings within our otherwise latin1-encoded application, by wrapping it with [[chaset:utf8]]. To make things a bit more readable, I added a helper function:
function utf8($utf8_encoded_byte_stream) { return '[[charset:utf8]]' . $utf8_encoded_byte_stream . '[[/charset:utf8]]'; }And we can now construct a string as simple as:
echo utf8("blÃ¥bær") . "grød";To produce the output: blåbærgrød
note: As pointed out by Kore, it would be a problem if the delimiter itself (Eg. [[charset:utf8]]) is part of the data. To remedy this, it would be safer to use a more unique delimiter. You could simply replace charset:utf8 with something that is unlikely to ever happen. It’s still not completely bulletproof, but it’s good enough for most practical uses.
Handling input.You may or may not know this, but when submitting a form, browsers send back data in the same encoding as the page was served. Since our application is predominantly latin1, we need user-input to be latin1, to keep BC. So all input must be decoded from utf-8 to latin1. This is simple enough; We just have to pipe all user-input ($_GET, $_POST etc.) through utf8_decode. Since we already run with the latin1-on-the-inside-utf-8-on-the-outside scheme, this was already in place in our case.
This does however give a problem when the user needs to submit utf-8, as our users would need when replying to mails. So in these places, we would have to explicitly access the “raw” string, through an alternate mechanism. In our case, we needed to modify our http-request wrapper, but since this is extending the API, there is no BC problems.
With the advent of PHP6, perhaps such hacks won’t be necessary in the future, but for now this gives a working, unobtrusive solution.
Rasmus Lerdorf: PHP Frameworks? Think Again.
This is the fist time I have heard Rasmus Lerdorf speak and it was entertaining to say the least. Refreshing would another way to describe it, I enjoy hearing real opinions and not holding back — Rasmus doesn’t hold back.
Just a short background, Rasmus Lerdorf is the creator of PHP and still continues as a core developer to the PHP project.
PHP frameworks
In his address he choose to highlight PHP frameworks (Drupal was not spared) and how poor they are at performance. Not only are they slow, but their "jack-of-all-trades" attitude leads developers down the wrong path by not using what is best for the job. He continues on by stating that PHP developers really need to think about performance for not only scalability reasons but for green reasons. If programs were more efficient it would cut the number of data centres and would reduce energy needs as a result. In our newly emerging age of energy awareness this does become an important aspect and I am glad that he is raising awareness.
Back to frameworks, he started by discussing a database heavy Twitter mashup that he created. This does a lot of database calls and a lot of behind the scenes work. By hand-tuning it he was able to get on the order of 280 req/sec. By comparison and simple HTML page with nothing but "Hello World" served by Apache is just over 600 req/sec. Okay, stage is set (by the way, this was tested on his local machine).
Hello World
How do PHP frameworks score on the "Hello World" test? No database calls, just the framework being used in its native tongue to output Hello World. The results were not too good, one of the fastest got just over 120 req/sec, the slowest was 8 req/sec. This is a dramatic difference and of course highlights his argument for performance. Where did Drupal score? Right above 50 req/sec. So not the greatest, but he did make the point that Drupal is not really a framework in the traditional sense. It is a web content management system that can be quickly extended.
So, are there any frameworks that don’t suck? Rasmus did mention that he liked CodeIgniter because it is faster, lighter and the least like a framework.
How to make PHP fast
"Well, you can’t" was his quick answer. PHP is simply not fast enough to scale to Yahoo levels. PHP was never meant for those sorts of tasks. "Any script based language is simply not fast enough". To get the speed that is necessary for truly massive web systems you have to use compiled C++ extensions to get true, scaleable architecture. That is what Yahoo does and so do many other PHP heavyweights.
RDF, Semantic Web and the Monkey
RDF in Drupal. Rasmus made a special point of highlighting the importance of embedding structured metadata into the page. RDFa allows you to embed data into your web pages and also lets you create custom vocabularies, or even better, reuse existing vocabularies. Why would you want to do this? Searchmonkey will go out and index this content and open up a rich search API to allow you to do intelligent queries. Well beyond what is possible with traditional search.
Along with rich search you also get enhanced search results. I have blogged about this previously so take a look. It is really cool stuff and I will be discussing it in much more detail over the course of the conference.
Pitching the Semantic Web
What if all Drupal sites had embedded RDFa tags? Well, for one, Yahoo would be very happy. It would play directly into the strengths of Yahoo’s new Semantic Web strategy. They are trying to do interesting things with semantic data but of course they need data — the classic chicken and egg thing.
Rasmus mentioned that Yahoo’s semantic data store can scale to the size of the web so the invitation is open.
The future of Drupal
This is where my focus at Drupalcon is, driving the adoption of semantic technologies within Drupal — I feel that the momentum here will make that a reality. There is a lot of interest, a Semantic Web BoF session was stacked with people with some cool ideas…
More to come.
