PhpRiot
News Archive
PhpRiot Newsletter
Your Email Address:

More information

How to add new (syntactic) features to PHP

Note: This article was originally published at Planet PHP on 21 April 2400.
Planet PHP

Several people have recently asked me where you should start if you want to add some new (syntactic) feature to PHP. As I'm not aware of any existing tutorials on that matter, I'll try to illustrate the whole process in the following. At the same time this is a general introduction to the workings of the Zend Engine. So upfront: I apologize for this overly long post.

This post assumes that you already have some basic knowledge of C and also know the fundamental concepts of the PHP implementation (like zvals). If not, you should read up on them beforehand.

As an example I'll use the addition of an in operator which you might already know from other languages like Python. It works as follows:

$words = ['hello', 'world', 'foo', 'bar']; var_dump('hello' in $words); // true var_dump('foo' in $words); // true var_dump('blub' in $words); // false $string = 'PHP is fun!'; var_dump('PHP' in $string); // true var_dump('Python' in $string); // false

So basically, for arrays the in operator is the same as the in_array function (but without the needle/haystack problem) and for strings it's like doing a false !== strpos($str2, $str1).

Prerequisites

Before we can get going, you'll have to first check out and compile PHP. To do so, you need a few tools. Most of them are probably already installed on your system, but you may need to install aore2ca and aobisona using the package manager of your choice. On Ubuntu you'd do this:

$ sudo apt-get install re2c $ sudo apt-get install bison

Next, clone php-src from git and compile it:

// get source code $ git clone http://git.php.net/repository/php-src.git $ cd php-src // create new branch for in operator $ git checkout -b addInOperator // build ./configure script $ ./buildconf // configure PHP in debug mode and with thread safety $ ./configure --disable-all --enable-debug --enable-maintainer-zts // compile (4 is the number of cores you have) $ make -j4

The PHP binary should now be available in sapi/cli/php. You can try to do a few things:

$ sapi/cli/php -v $ sapi/cli/php -r 'echo "Hallo World!";'

Now that you have a (hopefully) working PHP compile, we'll take a look at what PHP actually does when it runs a script.

The life of a PHP script

To run a script PHP goes through three main phases:

  1. Tokenization
  2. Parsing & Compilation
  3. Execution

In the following I'll explain what exactly is done in each phase, how it is implemented and what we need to change in order to get the in operator working.

Tokenization

In the first phase PHP reads in the source code and breaks it down into smaller units called aotokensa. For example the PHP code would be broken down to the following tokens:

T_OPEN_TAG (

As you can see the raw source code was broken down into semantically meaningful tokens. The process of doing so is referred to as tokenization, lexing or scanning and is implemented in the zend_language_scanner.l file. of the Zend/ directory.

If you open the file and scroll down a bit (to somewhere around line 1000), you'll find a large number of token definitions that look like this:

"exit" { return T_EXIT; }

The meaning should be rather obvious: If exit is encountered in the source code, the lexer should tag it as T_EXIT. The content between and is the state that the text should be matched in. ST_IN_SCRIPTING is the normal state for PHP code. Some examples of other states are ST_DOUBLE_QUOTE (in double quoted string), ST_HEREDOC (in heredoc string), etc.

Another thing that can be done in the scanning routines is specifying a aosemantica value (also called aolower valuea or aolvala for short). Here is an example:

{LABEL} { zend_copy_value(zendlval, yytext, yyleng); zendlval-type = IS_STRING; return T_STRING; }

{LABEL} matches a PHP identifier (it is defined as [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)

Truncated by Planet PHP, read more at the original (another 40279 bytes)