Big Data Dictionary

Storm

The Hadoop framework and its related large-scale data processing technologies (e.g. Pig and Hive) have been mainly designed for supporting batch processing tasks but they are not adequate for supporting real-time stream processing tasks. The ubiquity of mobile devices, location services, sensor pervasiveness and real-time network monitoring have created the crucial need for building scalable and parallel architectures to process vast amounts of streamed data.

The Storm system has been presented by Twitter as a distributed and fault-tolerant stream processing system that instantiates the fundamental principles of Actor theory. The key design principles of Storm are:

Horizontally scalable: Computations and data processing are performed in parallel using multiple threads, processes and machines.
Guaranteed message processing: The system guarantees that each message will be fully processed at least once. The system takes care of replaying messages from the source when a task fails.
Fault-tolerant: If there are faults during execution of the computation, the system will reassign tasks as necessary.
Programming language agnostic: Storm tasks and processing components can be defined in any language, making Storm accessible to nearly anyone. Clojure, Java, Ruby, Python are supported by default. Support for other languages can be added by implementing a simple Storm communication protocol.

The core abstraction in Storm is the stream. A stream is an unbounded sequence of tuples. Storm provides the primitives for transforming a stream into a new stream in a distributed and reliable way. The basic primitives Storm provides for performing stream transformations are spouts and bolts. A spout is a source of streams. A bolt consumes any number of input streams, carries out some processing, and possibly emits new streams. Complex stream transformations, such as the computation of a stream of trending topics from a stream of tweets, require multiple steps and thus multiple bolts. A topology is a graph of stream transformations where each node is a spout or bolt. Edges in the graph indicate which bolts are subscribing to which streams. When a spout or bolt emits a tuple to a stream, it sends the tuple to every bolt that subscribed to that stream. Links between nodes in a topology indicate how tuples should be passed around. Each node in a Storm topology executes in parallel. In any topology, we can specify how much parallelism is required for each node, and then Storm will spawn that number of threads across the cluster to perform the execution.

The above figure illustrates a sample Storm topology. In practice, The Storm system relies on the notion of stream grouping to specify how tuples are sent between processing components. In other words, it defines how that stream should be partitioned among the bolt's tasks. In particular, Storm supports dierent types of stream groupings such as:

Shuffle grouping where stream tuples are randomly distributed such that each bolt is guaranteed to get an equal number of tuples.
Fields grouping where the tuples are partitioned by the elds specied in the grouping.
All grouping where the stream tuples are replicated across all the bolts.
Global grouping where the entire stream goes to a single bolt.

In addition to the supported built-in stream grouping mechanisms, the Storm system allows its users to dene their own custom grouping mechanisms. In general, a Storm cluster is supercially similar to a Hadoop cluster. One key dierence is that a MapReduce job eventually nishes while a Storm job processes messages forever (or until the user kills it). In principle, there are two kinds of nodes on a Storm cluster:

The Master node runs a daemon called Nimbus (similar to Hadoop's Job- Tracker) which is responsible for distributing code around the cluster, assigning tasks to machines, and handling failures.
The worker nodes run a daemon called the Supervisor. The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it.

The above figure illustrates the architecture of Storm cluster where all the interactions between Nimbus and the Supervisors are done through a ZooKeeper cluster, an open source conguration and synchronization service for large distributed systems. Both the Nimbus daemon and Supervisor daemons are fail-fast and stateless, where all state is kept in ZooKeeper or on local disk. Communication between workers living on the same host or on dierent machines is based on ZeroMQ sockets over which serialized java objects (representing tuples) are being passed. Similar systems of Twitter Storm include Apache S4 and InfoSphere Streams.