News Archive
PhpRiot Newsletter
Your Email Address:

More information

Understanding PHP's internal array implementation (PHP's Source Code for PHP Developers - Part 4)

Note: This article was originally published at Planet PHP on 28 March 2012.
Planet PHP

Welcome back to the fourth part of the aoPHP's Source Code for PHP Developersa series, in which we'll cover how PHP arrays are internally represented and used throughout the code base.

In case you missed them, here are the previous parts of this series:

Everything is a hash table!

Basically, everything in PHP is a hash table. Not only are hash tables used in the underlying implementation of PHP arrays, they are also used to store object properties and methods, functions, variables and pretty much everything else.

And because the hash table is so fundamental to PHP, it is worth having a deeper look into how it works.

So, what is a hash table?

Remember that in C arrays are basically chunks of memory, which you can access by index. Thus arrays in C only have integer keys and have to be continuous (i.e. you can't have a key 0 and the next key is 1332423442). There is no such thing as an associative array.

And this is where hash tables come in: They convert string keys into normal integer keys using a hash function. The result can then be used as an index into a normal C array (aka chunk of memory). The problem here obviously is that the hash function can have collisions, i.e. multiple string keys can yield the same hash. For example in a PHP array with up to 64 elements the strings "foo" and "oof" would have the same hash.

This problem is solved by not storing the value directly at the generated index, but storing a linked list of possible values instead.

HashTable and Bucket

So, after the basic concept of hash tables is clear, let's have a look at the structures actually used in PHP's hash table implementation:

The first one is the HashTable:

typedef struct _hashtable { uint nTableSize; uint nTableMask; uint nNumOfElements; ulong nNextFreeElement; Bucket *pInternalPointer; Bucket *pListHead; Bucket *pListTail; Bucket **arBuckets; dtor_func_t pDestructor; zend_bool persistent; unsigned char nApplyCount; zend_bool bApplyProtection; #if ZEND_DEBUG int inconsistent; #endif } HashTable;

Let's quickly go through it:

  • nNumOfElements specifies how many values are currently stored in the array. This is also the number that count($array) returns.

  • nTableSize specifies the size of the internal C array. It is always the next power of 2 greater or equal to nNumOfElements. E.g. if an array stores 32 elements, the internal C array also has a size of 32. But if one more element is added, i.e. the array then contains 33 elements, the internal C array is resized to 64 elements.

    This is done to always keep the hash table efficient in space and time. It is clear that if the internal array is too small there will be many collisions and the performance will degrade. If the internal array is too big on the other hand, we'd be wasting memory. The power-of-2 size is a good compromise.

  • nTableMask is the table size minus one. This mask is used to adjust the generated hashes for the current table size. For example the actual hash for "foo" (through the DJBX33A hashing function) is 193491849. If we currently have a table size of 64, we obviously can't use that as an index into the array. Instead we only take the lower bits of the hash by applying the table mask:

    hash | 193491849 | 0b1011100010000111001110001001 & mask | & 63 | & 0b0000000000000000000000111111 --------------------------------------------------------- = index | = 9 | = 0b0000000000000000000000001001
  • nNextFreeElement is the next free integer key, which is used when you append to an array using $array[] = xyz.

  • pInternalPointer stores the current position in the array. This is used for foreach iteration and can be ac

Truncated by Planet PHP, read more at the original (another 13700 bytes)