Hadoop Tutorial

Posts

Showing posts from September, 2017

September 17, 2017

Meaning of Spark SQL: Spark SQL is programming module for working with structured data using data frame and data set abstractions. Spark SQL is the good optimization technique. In Spark SQL we can be querying the data from Spark inside that connect through JDBC and ODBC connectors to Spark SQL. Spark SQL act as a SQL query engine. Features of Spark SQL: Integrated – Spark SQL is the mixes of SQL queries so we can run queries complex analytic programs using tight integration property of Spark SQL. Unified Data Access – In Spark SQL we can load and be querying the data from various resources. Standard Connectivity – Spark SQL include server mode with standard JDBC and ODBC connectors. Scalability – In Spark SQL we can use one engine for interactive and long queries. Spark SQL Data Frames: Data Frame is the collections of distributed collections of data which organized into named columns. Data Frames is equivalent to relational tables o...

September 12, 2017

Apache Hive Data Types Hive is Data warehousing tool and used to process the data stored in hadoop and HDFS. Hive is similar to SQL because it analyze and process the data through querying language. In this article we are discuss about basic data types for Hive query processing. Hive Data Types are classified into four types, given as follows Column Types Literal Null Values Complex Types Column Types: 1. Integral Integer type data can be used to Integral data types. Integral data types mentioned as INT. There are four types INT data types TINYINT (1-byte signed integer, from -128 to 127) SMALLINT (2-byte signed integer, from -32,768 to 32,767) INT (4-byte signed integer, from -2,147,483,648 to 2,147,483,647) BIGINT (8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807) 2. String Types: String Data Types mentioned by Single Quotes and Double Quotes. It contains two types – CHAR, VARCHAR CHAR – CHAR is the f...

September 05, 2017

Types of Joins and Counters in Apache MapReduce What is MapReduce? Mapduce is the processing technique and program of distributed model based on Java. It contains two important tasks that is Map and Reduce. Map is used to joins the data sets and convert into another datasets where data is broken. Reduce task is take output from Map task and combine the data into small tuples. MapReduce Joins: MapReduce joins used to joins the two datasets and this processing contains more number codes for joining. Joining datasets are based on size of the data. If one data is smaller than one and that small data are distributed to all data nodes. After distribution the small data perform matches from large datasets and combine the all records to form output records. Types of Joins: Mapside Join – Mapside join means joins made by mappers. In Mapside join performed before data consumed by map function that it is the input for all maps and it formed at partions and sorted order. ...