The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

1 HDFS

HDFS has many similarities with other distributed file systems, but is different in several respects. One noticeable difference is HDFS's write-once-read-many model that relaxes concurrency control requirements, simplifies data coherency, and enables high-throughput access.

Another unique attribute of HDFS is the viewpoint that it is usually better to locate processing logic near the data rather than moving the data to the application space. HDFS rigorously restricts data writing to one writer at a time. Bytes are always appended to the end of a stream, and byte streams are guaranteed to be stored in the order written.

HDFS has many goals. Here are some of the most notable:

  • Fault tolerance by detecting faults and applying quick, automatic recovery
  • Data access via MapReduce streaming
  • Simple and robust coherency model
  • Processing logic close to the data, rather than the data close to the processing logic
  • Portability across heterogeneous commodity hardware and operating systems
  • Scalability to reliably store and process large amounts of data
  • Economy by distributing data and processing across clusters of commodity personal computers
  • Efficiency by distributing data and logic to process it in parallel on nodes where data is located
  • Reliability by automatically maintaining multiple copies of data and automatically redeploying processing logic in the event of failures

HDFS provides interfaces for applications to move them closer to where the data is located, as described in the following section.

1.1 Application interfaces into HDFS

You can access HDFS in many different ways. HDFS provides a native Java™ application programming interface (API) and a native C-language wrapper for the Java API. In addition, you can use a command line shell or a web browser to browse HDFS files.

1.2 HDFS architecture

HDFS is comprised of interconnected clusters of nodes where files and directories reside. An HDFS cluster consists of a single node, known as a NameNode, that manages the file system namespace and regulates client access to files. In addition, data nodes (DataNodes) store data as blocks within files.

1.2.1 Name nodes and data nodes

Within HDFS, a given name node manages file system namespace operations like opening, closing, and renaming files and directories. A name node also maps data blocks to data nodes, which handle read and write requests from HDFS clients. Data nodes also create, delete, and replicate data blocks according to instructions from the governing name node.

1.3 HDFS Java client

Copy
////////////////////////////////////////////////////////////////////////////////////////
//
// How to compile:
//	export CLASSPATH=$CLASSPATH:.:/usr/lib/crunch/lib/hadoop-common.jar:/usr/lib/crunch/lib/hadoop-annotations.jar
// 	javac MyHadoopIOTest.java 
// 	jar cvf MyHadoopIOTest.jar MyHadoopIOTest.class
//	/usr/bin/hadoop jar MyHadoopIOTest.jar MyHadoopIOTest
//
// Reference:
// 	http://free-hadoop-tutorials.blogspot.com/2011/04/using-hdfs-programmatically.html
//
////////////////////////////////////////////////////////////////////////////////////////
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
public class MyHadoopIOTest {
	public static void main(String[] args){
		try {
			System.out.println("Writing a file .... ");
			Path path = new Path("hello.txt");
			// Write a text file
			FileSystem fs = FileSystem.get(new Configuration());
			FSDataOutputStream fso = fs.create(path);
			fso.writeUTF("hello world");
			fso.close();
		
			// Read the text file
			FSDataInputStream fsi = fs.open(path);
			String greeting = fsi.readUTF();
			fsi.close();
			System.out.println(greeting);

		} catch (Exception ex){
			System.out.println("Some error: " + ex.toString());
		}
	}
}

1.4 The Small Files Problem

Small files are a big problem in Hadoop — or, at least, they are if the number of questions on the user list on this topic is anything to go by.

A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files, then you probably have lots of them (otherwise you wouldn’t turn to Hadoop), and the problem is that HDFS can’t handle lots of files.

Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.

Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.