To put the data challenges into simple terms, large scientific research projects often require centuries of computing time in order to be solved. To solve these projects and ever increasingly large projects in coming decades, new types of computing and novel use of existing resources is critical.
Distributed Computing: The Sum is Greater than the Whole
There are a number of new distributed computing systems including Dryad, FlumeJava, Flink, and Spark, among others. The challenge of distributed computing is to take advantage of the efficiencies while minimizing potential problems in dealing with resources spread out across the world as well as large amounts of information. Those who build distributed computing systems must be mathematically talented so as to create algorithms to deal with the many disparate and unpredictable parts of the system. For example, a common problem is dealing with load imbalances while making sure distributed execution and fault tolerance are robust.
According to NetLib.org, standalone workstations across the world are capable of providing tens of millions of operations per second, and their power is increasing yearly. At the same time, new technologies like ethernet, Fiber Distributed Data Interface (FDDI), High-Performance Parallel Interface (HiPPI), Synchronous Optical Network (SONET), and Asynchronous Transfer Mode (ATM) are advancing such that the technology exists for creating distributed computing networks that will be high-speed and scalable to meet increasing amounts of data for decades.
One example of the use of distributed computing is the Search for Extraterrestrial Intelligence (SETI) project which searches space for transmissions from other civilizations. SETI uses distributed cloud computing so that researchers around the world in many locations can share multiple computing systems. SETI uses IBM’s Apache Spark distributed computing framework at a cost of $17 million a year, and case studies show SETI has benefited greatly.
A good resource for information and research on distributed computing is the International Journal of Distributed Computing and its forum covering original and timely contributions to the field. In a paper, “Peer-to-Peer Computing” from Hewlett Packard (HP), the topics of anonymity, cost of ownership, decentralization, and self-organization of peer-to-peer networks is presented. According to HP:
The term “peer-to-peer” (P2P) refers to a class of systems and applications that employ distributed resources to perform a critical function in a decentralized manner. With the pervasive deployment of computers, P2P is increasingly receiving attention in research, product development, and investment circles. This interest ranges from enthusiasm, through hype, to disbelief in its potential. Some of the benefits of a P2P approach include: improving scalability by avoiding dependency on centralized points; eliminating the need for costly infrastructure by enabling direct communication among clients; and enabling resource aggregation.
IBM Apache Spark for Distributed Computing
IBM hired Forrester Research to do a study on the costs and benefits of IBM Apache Spark to SETI. According to experts, distributed and grid computing are becoming the choice for high-performance computing applications. In projects similar to SETI the huge amount of developed software infrastructure around the world can be harnessed. In “A Survey of Market-Based Approaches to Distributed Computing,” the idea of “market-based” advantages of the distributed computing paradigm are discussed.
The following video is an overview of “IBM Analytics for Apache Spark.”