The Hadoop framework transparently provides both reliability and data motion to applications. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which can execute or re-execute on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework. It enables applications to work with thousands of computation-independent computers and petabytes of data. The entire Apache Hadoop platform is now commonly considered to consist of the Hadoop kernel, MapReduce and Hadoop Distributed File System (HDFS), as well as a number of related projects including Apache Hive, Apache HBase, and others.
Hadoop is written in the Java programming language and is an Apache top-level project being built and used by a global community of contributors. Hadoop and its related projects (Hive, HBase, Zookeeper, and so on) have many contributors from across the ecosystem. Though Java code is most common, any programming language can be used with "streaming" to implement the "map" and "reduce" parts of the system.
Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration.
Computing power. Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.
Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
MapReduce programming is not a good match for all problems. It’s good for simple information requests and problems that can be divided into independent units, but it's not efficient for iterative and interactive analytic tasks. MapReduce is file-intensive. Because the nodes don’t intercommunicate except through sorts and shuffles, iterative algorithms require multiple map-shuffle/sort-reduce phases to complete. This creates multiple files between MapReduce phases and is inefficient for advanced analytic computing.
There’s a widely acknowledged talent gap. It can be difficult to find entry-level programmers who have sufficient Java skills to be productive with MapReduce. That's one reason distribution providers are racing to put relational (SQL) technology on top of Hadoop. It is much easier to find programmers with SQL skills than MapReduce skills. And, Hadoop-Developmentsistration seems part art and part science, requiring low-level knowledge of operating systems, hardware and Hadoop kernel settings.
Data security. Another challenge centers around the fragmented data security issues, though new tools and technologies are surfacing. The Kerberos authentication protocol is a great step toward making Hadoop environments secure.
Full-fledged data management and governance. Hadoop does not have easy-to-use, full-feature tools for data management, data cleansing, governance and metadata. Especially lacking are tools for data quality and standardization.
1. Why is Big Data important? What is Big Data? Characteristics of Big Data. Why should you care about Big Data? What are possible options for analyzing big?
2. Traditional Distributed Systems
3. Problems with traditional distributed systems
4.What is Hadoop? History of Hadoop. How does Hadoop solve Big Data problem? Components of Hadoop
5. What is HDFS? How HDFS works? Understand the Basic Architecture.
6. What is Mapreduce? How Mapreduce works?
7. How Hadoop works as a system?
8. What is Pig? How it works? Analyze data using Pig.
9. What is Hive? How it works? Analyze data using Hive.
10. What is Mapreduce? How it works? An Example.
11. What is Flume? How it works?An example.
12. What is Sqoop? How it works .An example
13. What is Oozie? How it works. An example.
15. Setting up Virtual Machine
16. Installing Hadoop Eco-system on a single node.
17. Understanding the configuration for single node and multi-node installation.
18. Hands On exercise.
19. Running your first MapReduce Program
20. Hands-on using Pig , Hive, MapReduce and Sqoop.
21. Understand how partitioners and combiners function in mapReduce.
22. Planning your Hadoop cluster. Hardware and Software considerations.
23. Scheduling in Hadoop
24. Monitoring your Hadoop Cluster
25. Monitoring tools available
26. Monitoring best practices
27. Administration Best practices
28. Hadoop Administration best practices
29. Tools of the trade