Hadoop Tutorial

Posts

Showing posts from June, 2017

June 30, 2017

Apache Hadoop Oozie Tutorial Introduction: Oozie is mainly used to manages the hadoop jobs in HDFS and it combines the multiple jobs in particular order to achieve the big task. It is the open source framework and used to make multiple hadoop jobs. Oozie supports the jobs in mapreduce,hive and hdfs also. In Oozie job workflow based on Directed Acylic Graph and it contains two nodes for managing the jobs that nodes are action and control flow nodes. Advantages of Oozie is it integrate with hadoop stack and also support mapreduce and hdfs jobs. Oozie contains following three types of jobs 1. Workflow jobs – It used to represents the sequence of jobs executed. 2. Coordinator Jobs – It contains workflow jobs and it triggered by time 3. Bundle Jobs – It contains the workflow and coordinator jobs Types of Nodes in Apache Oozie: Action Node – It represents the workflow jobs and jobs program are written in java Control F...

June 30, 2017

Ten Amazing Big Data Myths Big Data holds great promise for enterprises of all sizes. It can bring insights that help the business drive revenue and also understand gaps in service and products. Here are some myths about data: 1. Big data is new Huge cross references of every single word used in the Bible,called “Concordances”,were in use by scholar monks for centuries well before the first databases. 2. Big data is made for Big business Enterprises of all sizes are able to now leverage big data analytics thanks to recent improvement in cloud and data management technology. 3. Bigger Data is Better Quality of data wins over quantity of data.What to use is often more relevant than how much to use. 4. Our data is so messed up we can’t possibly master big data Advanced data quality,master data management,and data governance tools have made it easier to clean up the enterprise data mess. 5. Every Problem is a big data Problem If you are matching a couple f...

June 27, 2017

Top Five Big Data Tools Big Data is an open source framework and used to stores large amount of structured,unstructured and semi structured data. Big Data tools helps to extract and analyze the data which really save the time. In the world most of the companies used big data tools for accessing the hadoop Data. Here we are are discuss about Top Five big data tools . 1. Cassandra: Cassandra is the open source and NoSql database to handle the data across multiple data servers without any data failure. It serves data from database to online transaction applications and business intelligence because cassandra is the open source database. Cassandra created by Facebook and it is highly scalable and fault tolerance database. Cassandra used by most of big companies like Facebook,Ebay,Twitter , etc.. Cassandra tools occupies 300TB data over 400 Machines. 2. MongoDB: MongoDB is open source and document oriented big data tool and its created by 10gen. MongoDB used t...

June 20, 2017

Apache HDFS Architecture and Components HDFS means Hadoop Distributed File System and it manages big data sets with high volume. HDFS stores the data at distributed manner and it is the primary storage system. HDFS allows read and write the files but cannot updated the files in HDFS. When we move file in HDFS that file are splited into small files. HDFS are implemented by Master Slave architecture. Main Components of HDFS: 1. NameNode 2. Secondary NameNode 3. DataNode 4. Block NameNode: NameNode is the heart and master oh Hadoop It maintains the namespace system of hadoop NameNode stores the metadata of data blocks that data are permanently stored on Local disk It reduced disk space also. Secondary NameNode: Main role of secondary namenode is copy and merge the namespace. Secondary namenode requires huge amount of memory to merge the files. If namenode failure namespace images are stored in secondary namenode and it can be restart the namenod...

June 06, 2017

Types of Nodes in Hadoop 1. NameNode: NameNode is the main and heartbeat node of Hdfs and also called master. It stores the meta data in RAM for quick access and track the files across hadoop cluster. If Namenode failure the whole hdfs is inaccessible so NameNode is very critical for HDFS. NameNode is the health of datanode and it access datanode data only. NameNode Tracking all information from files such as which file saved in cluster, access time of file and Which user access a file on current time.There are two types of NameNode 2. Secondary NameNode: Secondary NameNode helps to Primary NameNode and merge the namespaces. Secondary NameNode stores the data when NameNode failure and used to restart the NameNode. It requires huge amount of memory for data storing. Secondary NameNode runs on different machines for memory management. Secondary NameNode is checking point of NameNode. 3. DataNode: DataNode stores actual data of HDFS and also called Sl...