The Hadoop framework and its related large-scale data processing technologies (e.g. Pig and Hive) have been mainly designed for supporting batch processing tasks but they are not adequate for supporting real-time stream processing tasks. The ubiquity of mobile devices, location services, sensor pervasiveness and real-time network monitoring have created the crucial need for building scalable and parallel architectures to process vast amounts of streamed data.
The Storm system has been presented by Twitter as a distributed and fault-tolerant stream processing system that instantiates the fundamental principles of Actor theory. The key design principles of Storm are:
The core abstraction in Storm is the stream. A stream is an unbounded sequence of tuples. Storm provides the primitives for transforming a stream into a new stream in a distributed and reliable way. The basic primitives Storm provides for performing stream transformations are spouts and bolts. A spout is a source of streams. A bolt consumes any number of input streams, carries out some processing, and possibly emits new streams. Complex stream transformations, such as the computation of a stream of trending topics from a stream of tweets, require multiple steps and thus multiple bolts. A topology is a graph of stream transformations where each node is a spout or bolt. Edges in the graph indicate which bolts are subscribing to which streams. When a spout or bolt emits a tuple to a stream, it sends the tuple to every bolt that subscribed to that stream. Links between nodes in a topology indicate how tuples should be passed around. Each node in a Storm topology executes in parallel. In any topology, we can specify how much parallelism is required for each node, and then Storm will spawn that number of threads across the cluster to perform the execution.
The above figure illustrates a sample Storm topology. In practice, The Storm system relies on the notion of stream grouping to specify how tuples are sent between processing components. In other words, it defines how that stream should be partitioned among the bolt's tasks. In particular, Storm supports dierent types of stream groupings such as:
In addition to the supported built-in stream grouping mechanisms, the Storm system allows its users to dene their own custom grouping mechanisms. In general, a Storm cluster is supercially similar to a Hadoop cluster. One key dierence is that a MapReduce job eventually nishes while a Storm job processes messages forever (or until the user kills it). In principle, there are two kinds of nodes on a Storm cluster:
The above figure illustrates the architecture of Storm cluster where all the interactions between Nimbus and the Supervisors are done through a ZooKeeper cluster, an open source conguration and synchronization service for large distributed systems. Both the Nimbus daemon and Supervisor daemons are fail-fast and stateless, where all state is kept in ZooKeeper or on local disk. Communication between workers living on the same host or on dierent machines is based on ZeroMQ sockets over which serialized java objects (representing tuples) are being passed. Similar systems of Twitter Storm include Apache S4 and InfoSphere Streams.
There has been error in communication with Booktype server. Not sure right now where is the problem.
You should refresh this page.