Software Development Metrics that Matter

As an industry we do a surprisingly poor job of measuring the work that we do and how well we do it. Outside of a relatively small number of organizations which bought into expensive heavyweight models like CMMI or TSP/PSP (which is all about measuring on a micro-level) or Six Sigma, most of us don’t measure enough,
don’t measure the right things or understand what to do with the things that we do measure. We can’t agree on something as basic as
how to measure programmer productivity or even on consistent measures of system size –should we count lines of code (LOC, SLOC, NCLOC, ELOC) or IFPUG function points or object oriented function points or weighted micro function points or COSMIC Full Function Points, or the number of classes or number of something else…?

Continue reading


Apache Camel Tutorial – Introduction to EIP, Routes, Components, Testing and other Concepts

Data exchanges between companies increase a lot. The number of applications, which must be integrated increases, too. The interfaces use different technologies, protocols and data formats. Nevertheless, the integration of these applications shall be modeled in a standardized way, realized efficiently and supported by automatic tests. Such a standard exists with the Enterprise Integration Patterns (EIP) [1], which have become the industry standard for describing, documenting and implementing integration problems. Apache Camel [2] implements the EIPs and offers a standardized, internal domain-specific language (DSL) [3] to integrate applications. This article gives an introduction to Apache Camel including several code examples.

Enterprise Integration Patterns

EIPs can be used to split integration problems into smaller pieces and model them using standardized graphics. Everybody can understand these models easily. Besides, there is no need to reinvent the wheel every time for each integration problem.

Using EIPs, Apache Camel closes a gap between modeling and implementation. There is almost a one-to-one relation between EIP models and the DSL of Apache Camel. This article explains the relation of EIPs and Apache Camel using an online shop example.

Use Case: Handling Orders in an Online Shop

The main concepts of Apache Camel are introduced by implementing a small use case. Starting your own project should be really easy after reading this article. The easiest way to get started is using a Maven archetype [4]. This way, you can rebuild the following example within minutes. Of course, you can also download the whole example at once[5].

Figure 1 shows the example from EIP perspective. The task is to process orders of an online shop. Orders arrive in csv format. At first, the orders have to be transformed to the internal format. Order items of each order must be split because the shop only sells dvds and cds. Other order items are forwarded to a partner.


: EIP Perspective of the Integration Problem

This example shows the advantages of EIPs: The integration problem is split into several small, perseverative subproblems. These subproblems are easy to understand and solved the same way each time. After describing the use case, we will now look at the basic concepts of Apache Camel.

Basic Concepts

Apache Camel runs on the Java Virtual Machine (JVM). Most components are realized in Java. Though, this is no requirement for new components. For instance, the camel-scala component is written in Scala. The Spring framework is used in some parts, e.g. for transaction support. However, Spring dependencies were reduced to a minimum in release 2.9 [6]. The core of Apache Camel is very small and just contains commonly used components (i.e. connectors to several technologies and APIs) such as Log, File, Mock or Timer.

Further components can be added easily due to the modular structure of Apache Camel., Maven is recommended for dependency management, because most technologies require additional libraries. Though, libraries can also be downloaded manually and added to the classpath, of course.

The core functionality of Apache Camel is its routing engine. It allocates messages based on the related routes. A route contains flow and integration logic. It is implemented using EIPs and a specific DSL. Each message contains a body, several headers and optional attachments. The messages are sent from a provider to a consumer. In between, the messages may be processed, e.g. filtered or transformed. Figure 1 shows how the messages can change within a route.

Messages between a provider and a consumer are managed by a message exchange container, which contains an unique message id, exception information, incoming and outgoing messages (i.e. request and response), and the used message exchange pattern (MEP). „In Only“ MEP is used for one-way messages such as JMS whereas „In Out“ MEP executes request-response communication such as a client side HTTP based request and its response from the server side.

After shortly explaining the basic concepts of Apache Camel, the following sections will give more details and code examples. Let’s begin with the architecture of Apache Camel.


Figure 2 shows the architecture of Apache Camel. A CamelContext provides the runtime system. Inside, processors handle things in between endpoints like routing or transformation. Endpoints connect several technologies to be integrated. Apache Camel offers different DSLs to realize the integration problems.

: Architecture of Apache Camel


The CamelContext is the runtime system of Apache Camel and connects its different concepts such as routes, components or endpoints. The following code snipped shows a Java main method, which starts the CamelContext and stops it after 30 seconds. Usually, the CamelContext is started when loading the application and stopped at shutdown.

public class CamelStarter {

public static void main(String[] args) throws Exception {
   CamelContext context = new DefaultCamelContext();

   context.addRoutes(new IntegrationRoute());





The runtime system can be included anywhere in the JVM environment, including web container (e.g. Tomcat), JEE application server (e.g. IBM WebSphere AS), OSGi container, or even in the cloud.

Domain Specific Languages

DSLs facilitate the realization of complex projects by using a higher abstraction level. Apache Camel offers several different DSLs. Java, Groovy and Scala use object-oriented concepts and offer a specific method for most EIPs. On the other side, Spring XML DSL is based on the Spring framework and uses XML configuration. Besides, OSGi blueprint XML is available for OSGi integration.

Java DSL has best IDE support. Groovy and Scala DSL are similar to Java DSL, in addition they offer typical features of modern JVM languages such as concise code or closures. Contrary to these programming languages, Spring XML DSL requires a lot of XML. Besides, it offers very powerful Spring-based dependency injection mechanism and nice abstractions to simplify configurations (such as JDBC or JMS connections). The choice is purely a matter of taste in most use cases. Even a combination is possible. Many developer use Spring XML for configuration whilst routes are realized in Java, Groovy or Scala.

Routes are a crucial part of Apache Camel. The flow and logic of an integration is specified here. The following example shows a route using Java DSL:

public class IntegrationRoute extends RouteBuilder {


public void configure() throws Exception {


                      .process(new LoggingProcessor())

                      .bean(new TransformationBean(), 




The DSL is easy to use. Everybody should be able to understand the above example without even knowing Apache Camel. The route realizes a part of the described use case. Orders are put in a file directory from an external source. The orders are processed and finally moved to the target directory.

Routes have to extend the „RouteBuilder“ class and override the „configure“ method. The route itself begins with a „from“ endpoint and finishes at one or more „to“ endpoints. In between, all necessary process logic is implemented. Any number of routes can be implemented within one „configure“ method.

The following snippet shows the same route realized via Spring XML DSL:

<beans … >
<bean class=”mwea.TransformationBean” id=”transformationBean”/>
<bean class=”mwea.LoggingProcessor” id=”loggingProcessor”/>

<camelContext xmlns=””>
        <from uri=”file:target/inbox”/>

                    <process ref=”loggingProcessor”/> 

                    <bean ref=”transformationBean”/>

        <to uri=”file:target/outbox”/>

Besides routes, another important concept of Apache Camel is its components. They offer integration points for almost every technology.


In the meantime, over 100 components are available. Besides widespread technologies such as HTTP, FTP, JMS or JDBC, many more technologies are supported, including cloud services from Amazon, Google, GoGrid, and others. New components are added in each release. Often, also the community builds new custom components because it is very easy.

The most amazing feature of Apache Camel is its uniformity. All components use the same syntax and concepts. Every integration and even its automatic unit tests look the same. Thus, complexity is reduced a lot. Consider changing the above example: If orders should be sent to a JMS queue instead of a file directory, just change the „to“ endpoint from „file:target/outbox“ to „jms:queue:orders“. That’s it! (JMS must be configured once within the application before, of course)

While components offer the interface to technologies, Processors and Beans can be used to add custom integration logic to a route.

Processors and Beans

Besides using EIPs, you have to add individual integration logic, often. This is very easy and again uses the same concepts always: Processors or Beans. Both were used in the route example above.

Processor is a simple Java interface with one single method: „process“. Inside this method, you can do whatever you need to solve your integration problem, e.g. transform the incoming message, call other services, and so on.

public class LoggingProcessor implements Processor {
public void process(Exchange exchange) throws Exception {
System.out.println(“Received Order: ” +

The „exchange“ parameter contains the Messsage Exchange with the incoming message, the outgoing message, and other information. Due to implementing the Processor interface, you have got a dependency to the Camel API. This might be a problem sometimes. Maybe you already have got existing integration code which cannot be changed (i.e. you cannot implement the Processor interface)? In this case, you can use Beans, also called POJOs (Plain Old Java Object). You get the incoming message (which is the parameter of the method) and return an outgoing message, as shown in the following snipped:

public class TransformationBean {

public String makeUpperCase(String body) {

String transformedBody = body.toUpperCase();

return transformedBody;

The above bean receives a String, transforms it, and finally sends it to the next endpoint. Look at the route above again. The incoming message is a File. You may wonder why this works? Apache Camel offers another powerful feature: More than 150 automatic type converters are included from scratch, e.g. FileToString, CollectionToObject[] or URLtoInputStream. By the way: Further type converters can be created and added to the CamelContext easily [7].

If a Bean only contains one single method, it even can be omitted in the route. The above call therefore could also be .bean(new TransformationBean()) instead of .bean(new TransformationBean(), “makeUpperCase”).

Adding some more Enterprise Integration Patterns

The above route transforms incoming orders using the Translator EIP before processing them. Besides this transformation, some more work is required to realize the whole use case. Therefore, some more EIPs are used in the following example:

public class IntegrationRoute extends RouteBuilder {
public void configure() throws Exception {

.process(new LoggingProcessor())
.bean(new TransformationBean())





Each csv file illustrates one single order containing one or more order items. The camel-csv component is used to convert the csv message. Afterwards, the Splitter EIP separates each order item of the message body. In this case, the default separator (a comma) is used. Though, complex regular expressions or scripting languages such as XPath, XQuery or SQL can also be used as splitter.

Each order item has to be sent to a specific processing unit (remember: there are dvd orders, cd orders, and other orders which are sent to a partner). The content-based router EIP solves this problem without any individual coding efforts. Dvd orders are processed via a file directory whilst cd orders are sent to a JMS queue.

ActiveMQ is used as JMS implementation in this example. To add ActiveMQ support to a Camel application, you only have to add the related maven dependency for the camel-activemq component or add the JARs to the classpath manually. That’s it. Some other components need a little bit more configuration, once. For instance, if you want to use WebSphere MQ or another JMS implementation instead of ActiveMQ, you have to configure the JMS provider.

All other order items besides dvds and cds are sent to a partner. Unfortunately, this interface is not available, yet. The Mock component is used instead to simulate this interface momentarily.

The above example shows impressively how different interfaces (in this case File, JMS, and Mock) can be used within one route. You always apply the same syntax and concepts despite very different technologies.

Automatic Unit and Integration Tests

Automatic tests are crucial. Nevertheless, it usually is neglected in integration projects. The reason is too much efforts and very high complexity due to several different technologies.

Apache Camel solves this problem: It offers test support via JUnit extensions. The test class must extend CamelTestSupport to use Camel’s powerful testing capabilities. Besides additional assertions, mocks are supported implicitly. No other mock framework such as EasyMock or Mockito is required. You can even simulate sending messages to a route or receiving messages from it via a producer respectively consumer template. All routes can be tested automatically using this test kit. It is noteworthy to mention that the syntax and concepts are the same for every technology, again.

The following code snipped shows a unit test for our example route:

public class IntegrationTest extends CamelTestSupport {


public void setup() throws Exception {

context.addRoutes(new IntegrationRoute());



public void testIntegrationRoute() throws Exception {
// Body of test message containing several order items

String bodyOfMessage = “Harry Potter / dvd, Metallica / cd, Claus Ibsen –

Camel in Action / book “;

// Initialize the mock and set expected results
MockEndpoint mock = context.getEndpoint(“mock:others”,


// Only the book order item is sent to the mock

// (because it is not a cd or dvd)

String bookBody = “Claus Ibsen – Camel in Action / book”.toUpperCase();


// ProducerTemplate sends a message (i.e. a File) to the inbox directory

template.sendBodyAndHeader(“file://target/inbox”, bodyOfMessage, Exchange.FILE_NAME, “order.csv”);

// Was the file moved to the outbox directory?

File target = new File(“target/outbox/dvd/order.csv”);

assertTrue(“File not moved!”, target.exists());

// Was the file transformed correctly (i.e. to uppercase)?

String content = context.getTypeConverter().convertTo(String.class, target);

String dvdbody = “Harry Potter / dvd”.toUpperCase();
assertEquals(dvdbody, content);
// Was the book order (i.e. „Camel in action“ which is not a cd or dvd) sent to the mock?



The setup method creates an instance of CamelContext (and does some additional stuff). Afterwards, the route is added such that it can be tested. The test itself creates a mock and sets its expectations. Then, the producer template sends a message to the „from“ endpoint of the route. Finally, some assertions validate the results. The test can be run the same way as each other JUnit test: directly within the IDE or inside a build script. Even agile Test-driven Development (TDD) is possible. At first, the Camel test has to be written, before implementing the corresponding route.

If you want to learn more about Apache Camel, the first address should be the book „Camel in Action“ [8], which describes all basics and many advanced features in detail including working code examples for each chapter. After whetting your appetite, let’s now discuss when to use Apache Camel…

Alternatives for Systems Integration

Figure 3 shows three alternatives for integrating applications:

  • Own custom Solution: Implement an individual solution that works for your problem without separating problems into little pieces. This works and is probably the fastest alternative for small use cases. You have to code all by yourself.
  • Integration Framework: Use a framework, which helps to integrate applications in a standardized way using several integration patterns. It reduces efforts a lot. Every developer will easily understand what you did. You do not have to reinvent the wheel each time.
  • Enterprise Service Bus (ESB): Use an ESB to integrate your applications. Under the hood, the ESB often also uses an integration framework. But there is much more functionality, such as business process management, a registry or business activity monitoring. You can usually configure routing and such stuff within a graphical user interface (you have to decide at your own if that reduces complexity and efforts). Usually, an ESB is a complex product. The learning curve is much higher than using a lightweight integration framework. Though, therefore you get a very powerful tool, which should fulfill all your requirements in large integration projects.

If you decide to use an integration framework, you still have three good alternatives in the JVM environment: Spring Integration [9], Mule [10], and Apache Camel. They are all lightweight, easy to use and implement the EIPs. Therefore, they offer a standardized way to integrate applications and can be used even in very complex integration projects. A more detailed comparison of these three integration frameworks can be found at [11].

My personal favorite is Apache Camel due to its awesome Java, Groovy and Scala DSLs, combined with many supported technologies. Spring Integration and Mule only offer XML configuration. I would only use Mule if I need some of its awesome unique connectors to proprietary products (such as SAP, Tibco Rendevous, Oracle Siebel CRM, Paypal or IBM’s CICS Transaction Gateway). I would only use Spring Integration in an existing Spring project and if I only need to integrate widespread technologies such as FTP, HTTP or JMS. In all other cases, I would use Apache Camel.

Nevertheless: No matter which of these lightweight integration frameworks you choose, you will have much fun realizing complex integration projects easily with low efforts. Remember: Often, a fat ESB has too much functionality, and therefore too much, unnecessary complexity and efforts. Use the right tool for the right job!

Apache Camel is ready for Enterprise Integration Projects

Apache Camel already celebrated its fourth birthday in July 2011 [12] and represents a very mature and stable open source project. It supports all requirements to be used in enterprise projects, such as error handing, transactions, scalability, and monitoring. Commercial support is also available.

The most important gains is its available DSLs, many components for almost every thinkable technology, and the fact, that the same syntax and concepts can be used always – even for automatic tests – no matter which technologies have to be integrated. Therefore, Apache Camel should always be evaluated as lightweight alternative to heavyweight ESBs. Get started by downloading the example of this article. If you need any help or further information, there is a great community and a well-written book available.


Apache Camel Tutorial – Introduction to EIP, Routes, Components, Testing, and other Concepts from our
JCG partner Kai Wahner at the
Blog about Java EE / SOA / Cloud Computing blog.

Source :

Mocks And Stubs – Understanding Test Doubles With Mockito


A common thing I come across is that teams using a mocking framework assume they are mocking.

They are not aware that Mocks are just one of a number of ‘Test Doubles’ which Gerard Meszaros has categorised at

It’s important to realise that each type of test double has a different role to play in testing. In the same way that you need to learn different patterns or refactoring’s, you need to understand the primitive roles of each type of test double. These can then be combined to achieve your testing needs.

I’ll cover a very brief history of how this classification came about, and how each of the types differs.

I’ll do this using some short, simple examples in Mockito.

A Very Brief History

For years people have been writing lightweight versions of system components to help with testing. In general it was called stubbing. In 2000′ the article ‘Endo-Testing: Unit Testing with Mock Objects’ introduced the concept of a Mock Object. Since then Stubs, Mocks and a number of other types of test objects have been classified by Meszaros as Test Doubles.

This terminology has been referenced by Martin Fowler in ‘Mocks Aren’t Stubs’ and is being adopted within the Microsoft community as shown in ‘Exploring The Continuum of Test Doubles’

A link to each of these important papers are shown in the reference section.

Categories of test doubles

The diagram above shows the commonly used types of test double. The following URL gives a good cross reference to each of the patterns and their features as well as alternative terminology.


Mockito is a test spy framework and it is very simple to learn. Notable with Mockito is that expectations of any mock objects are not defined before the test as they sometimes are in other mocking frameworks. This leads to a more natural style(IMHO) when beginning mocking.

The following examples are here purely to give a simple demonstration of using Mockito to implement the different types of test doubles.

There are a much larger number of specific examples of how to use Mockito on the website.

Test Doubles with Mockito

Below are some basic examples using Mockito to show the role of each test double as defined by Meszaros.

I’ve included a link to the main definition for each so you can get more examples and a complete definition.

Dummy Object

This is the simplest of all of the test doubles. This is an object that has no implementation which is used purely to populate arguments of method calls which are irrelevant to your test.

For example, the code below uses a lot of code to create the customer which is not important to the test.

The test couldn’t care less which customer is added, as long as the customer count comes back as one.

public Customer createDummyCustomer() {
 County county = new County('Essex');
 City city = new City('Romford', county);
 Address address = new Address('1234 Bank Street', city);
 Customer customer = new Customer('john', 'dobie', address);
 return customer;

public void addCustomerTest() {
 Customer dummy = createDummyCustomer();
 AddressBook addressBook = new AddressBook();
 assertEquals(1, addressBook.getNumberOfCustomers());

We actually don’t care about the contents of customer object – but it is required. We can try a null value, but if the code is correct you would expect some kind of exception to be thrown.

public void addNullCustomerTest() {
 Customer dummy = null;
 AddressBook addressBook = new AddressBook();

To avoid this we can use a simple Mockito dummy to get the desired behaviour.

public void addCustomerWithDummyTest() {
 Customer dummy = mock(Customer.class);
 AddressBook addressBook = new AddressBook();
 Assert.assertEquals(1, addressBook.getNumberOfCustomers());

It is this simple code which creates a dummy object to be passed into the call.

Customer dummy = mock(Customer.class);

Don’t be fooled by the mock syntax – the role being played here is that of a dummy, not a mock.

It’s the role of the test double that sets it apart, not the syntax used to create one.

This class works as a simple substitute for the customer class and makes the test very easy to read.

Test stub

The role of the test stub is to return controlled values to the object being tested. These are described as indirect inputs to the test. Hopefully an example will clarify what this means.

Take the following code

public class SimplePricingService implements PricingService
 PricingRepository repository;

 public SimplePricingService(PricingRepository pricingRepository) {
  this.repository = pricingRepository;

 public Price priceTrade(Trade trade) {
  return repository.getPriceForTrade(trade);

 public Price getTotalPriceForTrades(Collection
                      trades) {
  Price totalPrice = new Price();
  for (Trade trade : trades)
   Price tradePrice = repository.getPriceForTrade(trade);
   totalPrice = totalPrice.add(tradePrice);
  return totalPrice;


TheSimplePricingServicehas one collaborating object which is the trade repository. The trade repository provides trade prices to the pricing service through the getPriceForTrade method.

For us to test the businees logic in the SimplePricingService, we need to control these indirect inputs

i.e. inputs we never passed into the test.

This is shown below.

In the following example we stub the PricingRepository to return known values which can be used to test the business logic of the SimpleTradeService.

public void testGetHighestPricedTrade() throws Exception {
  Price price1 = new Price(10); 
  Price price2 = new Price(15);
  Price price3 = new Price(25);
  PricingRepository pricingRepository = mock(PricingRepository.class);
    .thenReturn(price1, price2, price3);
  PricingService service = new SimplePricingService(pricingRepository);
  Price highestPrice = service.getHighestPricedTrade(getTrades());
  assertEquals(price3.getAmount(), highestPrice.getAmount());

Saboteur Example

There are 2 common variants of Test Stubs: Responder’s and Saboteur’s.

Responder’s are used to test the happy path as in the previous example.

A saboteur is used to test exceptional behaviour as below.

public void testInvalidTrade() throws Exception {

  Trade trade = new FixtureHelper().getTrade();
  TradeRepository tradeRepository = mock(TradeRepository.class);

    .thenThrow(new TradeNotFoundException());

  TradingService tradingService = new SimpleTradingService(tradeRepository);

Mock Object

Mock objects are used to verify object behaviour during a test. By object behaviour I mean we check that the correct methods and paths are excercised on the object when the test is run.

This is very different to the supporting role of a stub which is used to provide results to whatever you are testing.

In a stub we use the pattern of defining a return value for a method.


In a mock we check the behaviour of the object using the following form.


Here is a simple example where we want to test that a new trade is audited correctly.

Here is the main code.

public class SimpleTradingService implements TradingService{

  TradeRepository tradeRepository;
  AuditService auditService;
  public SimpleTradingService(TradeRepository tradeRepository, 
                              AuditService auditService)
    this.tradeRepository = tradeRepository;
    this.auditService = auditService;

  public Long createTrade(Trade trade) throws CreateTradeException {
  Long id = tradeRepository.createTrade(trade);
  return id;

The test below creates a stub for the trade repository and mock for the AuditService

We then call verify on the mocked AuditService to make sure that the TradeService calls it’s

logNewTrade method correctly

TradeRepository tradeRepository;
AuditService auditService;
public void testAuditLogEntryMadeForNewTrade() throws Exception { 
  Trade trade = new Trade('Ref 1', 'Description 1');
  TradingService tradingService 
    = new SimpleTradingService(tradeRepository, auditService);

The following line does the checking on the mocked AuditService.


This test allows us to show that the audit service behaves correctly when creating a trade.

Test Spy

It’s worth having a look at the above link for the strict definition of a Test Spy.

However in Mockito I like to use it to allow you to wrap a real object and then verify or modify it’s behaviour to support your testing.

Here is an example were we check the standard behaviour of a List. Note that we can both verify that the add method is called and also assert that the item was added to the list.

List listSpy = new ArrayList();

public void testSpyReturnsRealValues() throws Exception {
 String s = 'dobie';
 listSpy.add(new String(s));

 assertEquals(1, listSpy.size());

Compare this with using a mock object where only the method call can be validated. Because we only mock the behaviour of the list, it does not record that the item has been added and returns the default value of zero when we call the size() method.

                      listMock = new ArrayList

public void testMockReturnsZero() throws Exception {
 String s = 'dobie';

 listMock.add(new String(s));

 assertEquals(0, listMock.size());


Another useful feature of the testSpy is the ability to stub return calls. When this is done the object will behave as normal until the stubbed method is called.

In this example we stub the get method to always throw a RuntimeException. The rest of the behaviour remains the same.

public void testSpyReturnsStubbedValues() throws Exception {
 listSpy.add(new String('dobie'));  
 assertEquals(1, listSpy.size());
 when(listSpy.get(anyInt())).thenThrow(new RuntimeException());

In this example we again keep the core behaviour but change the size() method to return 1 initially and 5 for all subsequent calls.

public void testSpyReturnsStubbedValues2() throws Exception {
 int size = 5;
 when(listSpy.size()).thenReturn(1, size);
 int mockedListSize = listSpy.size();
 assertEquals(1, mockedListSize);
 mockedListSize = listSpy.size();
 assertEquals(5, mockedListSize);  

 mockedListSize = listSpy.size();
 assertEquals(5, mockedListSize);  

This is pretty Magic!

Fake Object

Fake objects are usually hand crafted or light weight objects only used for testing and not suitable for production. A good example would be an in-memory database or fake service layer.

They tend to provide much more functionality than standard test doubles and as such are probably not usually candidates for implementation using Mockito. That’s not to say that they couldn’t be constructed as such, just that its probably not worth implementing this way.

Mocks And Stubs – Understanding Test Doubles With Mockito from our
JCG partner John Dobie at the
Agile Engineering Techniques blog.

Source :

HDFS for dummies

Whenever a newbie wants to start learning the Hadoop, the number of elements in a Hadoop stack are mind bogling and at times difficult to comprehend. I am trying to de-crypt the whole stack and help explain the basic pieces in my own way. Before we start talking about the Hadoop Stack, let us take a step back and try to understand what led to the origins to the Hadoop.

Problem – With the prolification of the internet, the amount of data stored growing up. Lets take an example of a search engine (like Google), that needs to index the large of amount of data that is being generated. The search engine crawls and indexes the data. The index data is stored and retrieval from a single storage device. As the data generated grows, the search index data will keep on increasing.

As the number of queries to access data increase, the current file system I/O becomes inadequate to retrieve large amounts of data simuntaneously. Further, the model of one large single storage starts becoming a bottleneck. To overcome the problem, we move the file system from a single disk storage to a clustered file system. But as the amount of data keeps growing the underlying data that can go one one machine starts to become a bottleneck.

As data reaches TB’s, existing file system based solutions starts faltering. Data access, multiple writers, large file sizes soon become a problem in scaling up the system.

Solution – To overcome the problems, an distributed file system was concieved that provided solution to the above problems. The solution tackled the problem as

  • When dealing with large files, I/O becomes a big bottleneck. So, we divide the files into small blocks and store in multiple machines. [Block Storage]
  • When we need to read the file, the client sends a request to multiple machines, each machine sends a block of file which is then combined together to pierce the whole file.
  • With the advent of block storage, the data access becomes distributed and leads to a faster retrieval/write
  • As the data blocks are stored on multiple machines, it helps in removing single point of failure by having the same block on multiple machines. Meaning, if one machine goes, the client can request the block from another machine.

Now, any solution that implements file storage as blocks needs to have the following characteristics

  • Manage the meta data information – Since the file gets broken into multiple blocks, somebody needs to keep track of no of blocks and storage of these blocks on different machines [NameNode]
  • Manage the stored blocks of data and fulfill the read/write requests [DataNodes]

So, in the context of Hadoop –The
NameNode is the arbitrator and repository for all metadata. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. All these component together form the Distributed File System called as
HDFS (Hadoop Distributed File System).

Reference –

HDFS has an inbuild redundancy and replication feature that makes sure that any failure of the machine can be dealt without any loss of data. The HDFS balances itself whenever a new data node is added to the cluster or any of the existing datanode fails.

In addition to the distributed file system called HDFS (Hadoop Distributed File System), there are 2 other core components

  • Hadoop Common – are set of utilities that support the Hadoop subprojects. Hadoop Common includes FileSystem, RPC, and serialization libraries.
  • Hadoop MapReduce – is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes

So, effectively, when you start working with Hadoop, HDFS and Hadoop MapReduce are the first 2 things you encounter. I will cover MapReduce in subsequent posts.

HDFS for dummies from our
JCG partner Munish K Gupta at the
Tech Spot blog.

Source :