A few years ago, Damien Seguy started collecting information about what version of PHP people were using and how PHP's usage compared to competing technologies such as ASP.NET. As a release manager for PHP 5.1 and 5.2, it was particularly interesting to me, because the monthly stats showed the adoption trends of PHP 5.2 and served as a good gauge of how quickly people were migrating. I was also actively involved in the development of FUDforum, and this data helped determine what new PHP features I could rely on and whether support for older versions of PHP could be discontinued.
Unfortunately, sometime in 2008, the process of gathering these stats petered out, and the PHP community was left without it. About a month ago, after talking to Damien, I decided to restart the process and eventually expand it from 11 million domains to about 120 million. I want to share some of the data and conclusions that can be derived from my initial run.
Before we jump right into the statistics, let's take a moment to review how the data was gathered. The first step of the process was to write a tool that would be able to gather the data. To keep things simple, I decided to use pecl_http, because it allows multiple parallel requests, which I would need in order to do millions of requests within a reasonable timeframe. Given that the test was running on a fairly powerful server, I assumed that the bottleneck was likely to be network based, and writing the tool in C would not yield any substantial benefits. At this point, the goal was to get a gauge for the speed of data retrieval to see how practical it would be to generate the data on a monthly basis.
To minimize bandwidth usage and to speed things up, I used HEAD requests with a 3-second timeout. I arbitrarily decided that if a site does not respond within 3 seconds, it's not worth testing, because I would certainly not wait more than 3 seconds for a page to start loading.
My first test runa€‰-a€‰with 25 parallel requests yielding 10 requests per seconda€‰-a€‰was way too slow for my purposes. Increasing the number of parallel requests, surprisingly, did nothing to improve speed. Looking at the CPU and network utilization also did not expose any issues; the load was negligible, and there was very little traffic on the network. A bit of a WTF moment. Fortunately, the problem was quick to identify, although it did leave me a bit disappointed in libcurl, which pecl_http relies on. While libcurl can certainly process a large number of parallel requests, when it comes to resolving the domain name to an IP address, the process is actually sequential and not parallel! Surprised? Yeah, I was, too. So, how do you resolve 12 million domains quickly? Well, if Ken Thompson is to be believed, when in doubt, use brute force, and by brute force, I mean C. And thus, resolv.c came to be, a 150-line multi-process resolver. Using 50 forked children, it blew through 12 million domains in just about 30 hours, and in the process, made named use about 3.4 gigabytes of memory.
After resolving, I tweaked the original PHP code to make connections to the IPs directly and send a Host header containing the corresponding domain. With this in place, it was just a matter of determining how many request would saturate the 10 MB pipe. This magic number ended up being 400 parallel requests, which kept the requests going at an average speed of 150 requests per second. After another day of operation, I had 10.8 million successfully resolved and completed requests from the initial 12.3 million data sample.
To minimize the overhead during the request processing, the actual data analysis was left until the end. If you are curious how many gigabytes it takes to store 10.8 million headers, the answer is around 4.9. For my purposes, I focused on three headers, X-Powered-By, Server, and Cookie.
The results!PHP 3998425 59% ASP.NET 2294166 34% Perl 259931 4% Python 159475 2% Ruby 16539 0% Java 18065 0%
The above chart shows the breakdown of the 6 major, identifiable languages from 6.7 million domains where the language could be determined. One of the surprising things to me was the popularity of ASP.NET. The next chart, showing the web server popularity, will explain this anomaly.
Truncated by Planet PHP, read more at the original (another 3417 bytes)