Site icon DesignLinux

How to Install and Configure Hive with High Availability – Part 7

Hive is a Data Warehouse model in Hadoop Eco-System. It can perform as an ETL tool on top of Hadoop. Enabling High Availability (HA) on Hive is not similar as we do in Master Services like Namenode and Resource Manager.

Automatic failover will not happen in Hive (Hiveserver2). If any Hiveserver2 (HS2) fails, running jobs on that failed HS2 will get fail. We need to resubmit the job so that the job can run on other HiveServer2. So, enabling HA on HS2 is nothing but, increasing the number of HS2 components in Cluster.

In this article, we will see the steps to install and enable the High Availability of Hive.

Requirements

Let’s get started…

Hive Installation and Configuration

1. Login to Cloudera Manager at the below URL and navigate to Cloudera Manager –> Add Service.

http://13.233.129.39:7180/cmf/home
Add Service in Cloudera Manager

2. Select the service ‘Hive‘.

Choose Hive Service

3. Assign the services on nodes.

Once selected the servers, click ‘Continue‘ to proceed.

Assign Service as Nodes

4. Hive Metastore needs an underlying Database for storing Metadata. Here we are using the default PostgreSQL database which is inbuilt with CDH.

Below mentioned database details will be entered automatically, ‘Test Connection’ will be skipped as the mentioned database will be created on the fly. In real-time, we need to create the Database in the external database and test the connection to proceed further. Once done, please click the ‘Continue’.

Setup Database

5. Configure the Hive Warehouse directory, /user/hive/warehouse is the default directory path for storing Hive tables. Click the ‘Continue’.

Choose Hive Warehouse Directory

6. Installation of Hive is started.

Hive Installation Progress

7. Once installation completed, you can get the ‘Finished’ status. Click ‘Continue’ to proceed further.

Hive Installation Finished

8. Hive installation and Configuration completed successfully. Click ‘Finish‘ to complete the installation procedure.

Finish Hive Installation

9. You can see the Hive service added in Cluster through Cloudera Manager Dashboard.

Hive Service Added

10. You can view the Hiveserver2 in Instances of Hive. We have added Hiveserver2 in master1.

Cloudera Manager –> Hive –> Instances –> Hiveserver2.

View Hiveserver2 Instances

Enabling High Availability on Hive

11. Next add Hive role by going to Cloudera Manager –> Hive –> Actions –> Add Role Instances.

Add Hive Role Instance

12. Select the servers where you want to place extra Hiveserver2. You can add more than two, there is no limit. Here we are adding an extra one Hiveserver2 in master2.

Choose Server for Hive
Choose Host Server

13. Once selected the server, click ‘Continue’.

Server Added

14. A Hiverserver2 will be added into the Hive Instances, you need to start it by going to Cloudera Manager –> Hive –> Instances –> (Select Hiveserver2 added newly) –> Action for Selected –> Start.

Choose Hive Server
Start Hive Server
Start the Hive Server

15. Once Hiveserver2 started on master2, you will get the status ‘Finished’. Click Close.

Status Finished

16. You can view, both the Hiveserver2s are running.

Verify Hive Servers Status

Verifying the Hive Availability

We can connect the Hiveserver2 through the beeline which is a thin client and command-line. It uses the JDBC driver to establish the connection.

17. Login to Server where Hive Gateway is running.

[[email protected] ~]$ beeline
Connect to Hiveserver2

18. Enter the JDBC connection string to connect the Hiveserver2. In this connection, the string we are mentioning the Hiverserver2 (master2) with its default port number 10000. This connection string will only connect to the Hiveserver2 which is running on master2.

beeline> !connect "jdbc:hive2://master1.tecmint.com:10000"
JDBC Connection String

19. Run a sample query.

0: jdbc:hive2://master1.tecmint.com:10000> show databases;
Run Sample Query

This is the default database that comes inbuilt.

20. Use the below command to terminate the Hive session.

0: jdbc:hive2://master1.tecmint.com:10000> !quit
Quit Hive Session

21. You can use the same way to connect Hiveserver2 running on master2.

beeline> !connect "jdbc:hive2://master2.tecmint.com:10000"
Connect to Hive server

23. We can connect the Hiveserver2 in Zookeeper Discovery mode. In this method, we don’t need to mention the Hiveserver2 in the connection string instead we are using Zookeeper to discover the available Hiveserver2.

Here we can use a third-party load balancer to balance the load among the available Hiverserver2. The below configuration is need to enable Zookeeper Discovery Mode by going to Cloudera Manager –> Hive –> Configuration.

Enable Zookeeper Discovery Mode

24. Next, search the property “HiveServer2 Advanced Configuration Snippet” and click the + symbol to add the below property.

Name : hive.server2.support.dynamic.service.discovery
Value : true
Description : <any description>
HiveServer2 Advanced Configuration Snippet

25. Once entered the property, click ‘Save Changes’.

Add Property

26. As we made changes to the Configuration, need to restart the affected services by clicking on the Orange Color Symbol to restart the services.

Restart Services

27. Click ‘Restart Stale‘ services.

Restart Stale Services

28. There are two options available. If the cluster is in live production, we need to prefer the rolling restart to minimize the outage. As we are newly installing, we can choose the second option ‘Re-deploy Client Configuration’, and click ‘Restart Now’.

Re-deploy Client Configuration

29. Once the restart completed successfully, you will be getting the status ‘Finished’. Click ‘Finish’ to complete the process.

Finish The Process

30. Now we will connect the Hiveserver2 using Zookeeper Discovery mode. In the JDBC connection, the string we need to use the Zookeeper servers with its port number 2081. Collect the Zookeeper servers by going to Cloudera Manager –> Zookeeper –> Instances –> (Note down the server names).

Zookeeper Servers

These are the three servers having Zookeeper, 2181 is the port number.

master1.tecmint.com:2181
master2.tecmint.com:2181
worker1.tecmint.com:2181

31. Now get into beeline.

[[email protected] ~]$ beeline
Connect to Beeline

32. Enter the JDBC connection string as mention below. We have to mention the Service Discovery Mode and Zookeeper Namespace. ‘hiveserver2’ is the default Namespace of Hiveserver2.

beeline>!connect "jdbc:hive2://master1.tecmint.com:2181,master2.tecmint.com:2181,worker1.tecmint.com:2181/;serviceDiscoveryMode=zookeeper;zookeeperNamespace=hiveserver2"
Enter JDBC Connection String

33. Now the session is connected to Hiveserver2 running on master1. Run a sample query to validate. Use the below command to create a database.

0: jdbc:hive2://master1.tecmint.com:2181,mast> create database tecmint;
Create Database

34. Use the below command to list the database.

0: jdbc:hive2://master1.tecmint.com:2181,mast> show databases;
List Database

35. Now we will validate the High Availability in Zookeeper Discovery Mode. Go to Cloudera Manager and stop the Hiveserver2 on master1 that we have tested above.

Cloudera Manager –> Hive –> Instances –> (select Hiveserver2 on master1) –> Action for selected –> Stop.

Choose Hive Server
Stop Hive Server

36. Click the ‘Stop’. Once stopped, you will be getting the status ‘Finished’. Verify the Hiveserver2 on master1 by navigating into Hive –> Instances.

Stop Hive server
Verify Hive Server

37. Get into the beeline and connect the Hiveserver2 using the same JDBC connection string with Zookeeper Discovery Mode as we did in the above steps.

[[email protected] ~]$ beeline

beeline>!connect "jdbc:hive2://master1.tecmint.com:2181,master2.tecmint.com:2181,worker1.tecmint.com:2181/;serviceDiscoveryMode=zookeeper;zookeeperNamespace=hiveserver2"
Connect the Hiveserver2

Now you will be connected to Hiveserver2 running on master2.

38. Validate with a sample query.

0: jdbc:hive2://master1.tecmint.com:2181,mast> show databases;
Validate Sample Query
Conclusion

In this article, we have gone through the detailed steps to have the Hive Data Warehouse model in our Cluster with High Availability. In a real-time production environment, more than three Hiveserver2 will be placed with Zookeeper Discovery Mode enabled.

Here, all the Hiveserver2’s are registering with Zookeeper under a common Namespace. Zookeeper Dynamically discovers the available Hiveserver2 and establishes the Hive session.

Exit mobile version