Apache Hive Interview Questions:
Hive Interview Questions : 1)Explain what is Hive?
Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System (HDFS). Hive is an open-source-software that lets programmers analyze large data sets on Hadoop Cluster. It is a data warehouse framework for querying and analysis of data that is stored in HDFS.
2)Mention what are the different types of tables available in Hive?
There are two types of tables available in Hive.
- Managed table: In managed table, both the data and schema are under control of Hive
- External table: In the external table, only the schema is under the control of Hive.
3)Why do we need Hive?
Hive is a tool in Hadoop ecosystem which provides an interface to organize and query data in a database like fashion and write SQL like queries. It is suitable for accessing and analyzing data in Hadoop using SQL syntax.
4)Can we run unix shell commands from hive? Give example?
Yes, using the ! mark just before the command. For example !ls at hive prompt will list all files and directory.
5)Why is Hive not suitable for OLTP systems?
Hive is not suitable for OLTP systems because it does not provide insert and update function at the row level.
6)What is Hive a metastore?
Hive Metastore is a central repository that stores metadata in external database. Hive contains two things: data and the metadata. The metadata contains the (column names, partitions information, bucketing information, SerDe etc.) i.e., the data about the actual table this is by default stored in the Derby database, we can also configure it to Oracle or MySQL database.
7)What is the limitation of Derby database for Hive metastore?
Derby database runs in the local mode and it creates a log file so that multiple users cannot access Hive simultaneously. With derby database, you cannot have multiple connections or multiple session’s instantiated at the same time.
8)What is the default location to save data in Hive?
The default location to save data in Hive:
9)What are the types of tables in Hive?
There are three types of tables:
- Managed Table
- External Table
- Partitioned Table
10)Is HQL language is case sensitive?
No, HQL LANGUAGE is not case sensitive.
11)Are multi-line comments supported in Hive?
No, Multi-line comments are not supported in Hive.
12)What is the maximum size of string data type supported by Hive?
The Maximum size of string data type supported by Hive is 2 GB.
13)How can you avoid mapreduce is Hive?
To avoid mapreduce is hive, you need to set below property to TRUE.
14)How can you rename a table in Hive?
Below command is used to rename a table in Hive:
Alter Table table_name RENAME TO new_table_name
15)How do you check if a particular partition exists?
You can check using below commands:
SHOW PARTITIONS table_name PARTITION(partitioned_column=’partition_value’)
16)What is a generic UDF in hive?
Generic UDF is created using a java program to server some specific need. If some functions are not covered under the existing functions, we need to use UDF. It can detect the type of input argument pro-grammatically and provide appropriate response.
17)Which classes are used by the Hive to Read and Write HDFS Files?
- TextInputFormat/HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format.
- SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in hadoop SequenceFile format.
18)What are the Binary Storage formats supported in Hive?
The Binary Storage formats supported in Hive are Sequence Files, Avro Data files, RCFiles, ORC files, Parquet files while Text File format is default file format.
19)How can client interact with Hive?
There are 3 ways possible by which a Client can interact with the Hive:
- Hive Thrift Client: Any programming language that supports thrift, can interact with HIVE.
- JDBC Driver: To connect to the HIVE Server the BeeLine CLI uses JDBC Driver.
- ODBC Driver: You can connect with different application using ODBC driver to the Hive Server.
20)How Hive organize the data?
There are 3 ways possible in which Hive organizes data.
21)What are the different ways to optimize Hive Performance?
There are different ways to optimize Hive performance:
- Hive Partitioning
- Bucketing in Hive
- Hive Indexing
- Using suitable file format
- Vectorization In Hive
- Cost-Based Optimization (CBO) in Hive
22)What do you mean by Schema on Read?
Schema on Read means Schema is validated while reading the data and not enforced while writing the data.
23)What is the use of explode in Hive?
Explode in Hive is used to convert complex data types into desired table formats. Explode UDTF basically emits all the elements in an array into multiple rows.
24)What is the use of HCatalog?
HCatalog can be used to share data structures with external systems. HCatalog provides access to hive metastore to users of other tools on Hadoop so that they can read and write data to hive’s data warehouse.
25)What are the default record and field delimiter used for hive text files?
The default record delimiter is \n
The field delimiter is 01,02,03
26)Explain the difference between partitioning and bucketing.
- Partitioning and Bucketing of tables is done to improve the query performance.
- Partitioning helps eliminate data when used in WHERE clause.
- Bucketing helps organize data inside the partition into multiple files so that same set of data will always be written in the same bucket. Bucketing helps in joining various columns.
- Partition is created for every unique value of the column while in bucketing, one can limit it to a specific number and the data can then be decomposed in those buckets.
- Partition is a directory while Bucketing is a file.
27)Explain about SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY in Hive.
SORT BY –
- Order the data within reducer (Local ordering)
- Data is ordered at each of ‘N’ reducers where the reducers can have overlapping range of data.
- All data to be passed from single reducer.
- Total ordering of the query
- Unnecessary long time to execute for longer datasets
DISTRUBUTE BY –
- Map Output is divided among reducers.
- Each of reducer gets non-overlapping ranges, but doesn’t sort the output of each reducer.
- Distribute by clause come before sort by clause.
- It is a combination of DISTRIBUTE BY and SORT BY
- Each of N reducers gets non overlapping ranges of data which is then sorted by those ranges at the respective reducers.
- Output is global ordering.
28)What is SerDe in Apache Hive?
SreDe is Serializer/Deserializer in hive, SerDe is an acronym. Basically Hive uses for Input Output operation. It also, interprets the results of serialization as individual fields for processing.
29)Can we load data into a view?
No, we can’t load data into a view.
30)What Is The Usefulness Of The Distributed By Clause In Hive?
Distributed By Clause controls how the map output is reduced among the reducers. It is useful in case of streaming data.