An introduction to Hadoop
Note: This article was originally published at Planet PHP on 21 July 2010.
What is Hadoop?Apache Hadoop is a Java-Framework for large-scale distributed batch processing infrastructure which runs on standard computers. The biggest advantage is the ability to scale to hundreds or thousands of computers. Hadoop is designed to efficiently distribute and handle large amounts of work across a set of machines. When I speak of very large amounts of data, in this case I mean hundreds of Giga-, Tera- or Peta-Bytes. There is not enough space for such an amount of data on a single node. Therefore, there is a separate "Hadoop Distributed File Systems (HDFS)" which splits data into many smaller parts and distributes each part redundantly across multiple nodes. Hadoop can handle lower amounts of data and is even able to run on a single computer, but it is not particularly efficient because of the resulting overhead. There are better alternatives like Gearman in such a case.
A Hadoop-Cluster often consists of thousands of nodes whereof errors are a daily occurrence. Therefore, it has a high degree of fault tolerance to correct failures as quickly and as automatically as possible.
Continue reading "An introduction to Hadoop"