What is Hive?
--It is an open source data warehouse tool to process and query data on hdfs
-->Hive always reaches out to hdfs if it requires any data
-->hdfs is the storage unit and map-reduce is processing unit for hive
Why Hive?
--We know that we need map reduce to process data on hdfs and give us result data
--But map reduce is quite difficult to write and very monotonous
--Due to this reason we came up with Hive
--Hive queries are written using HQL(hive query language) which convert into map reduce code and get the result data from hdfs in a structured format (tables).
What data is stored in hive?
--Structured data (from hdfs)
--Metadata of tables (schema) from rdbms
Why metadata is not stored in hdfs?
--Data in hdfs can't be edited/changed so we can't metadata here for any updations
--Data in hdfs is difficult to retrieve quickly (low latency)
Why hive when we have RDBMS?
--Hive runs on distributed systems and queries are converted into map reduce
--RDBMS runs on single system and parallelism is not present
Do we need to write map-reduce code to retrieve data from hdfs?
--NO, we write queries in HQL which are converted internally into map reduce tasks which actually process the data and give result sets to hive.





Post a Comment