1 Before start

Hadoop must be installed on your system before installing Hive. Let us verify the Hadoop installation using the following command:

Copy
$ hadoop version
Hadoop 2.8.2
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 66c47f2a01ad9637879e95f80c41f798373828fb
Compiled by jdu on 2017-10-19T20:39Z
Compiled with protoc 2.5.0
From source with checksum dce55e5afe30c210816b39b631a53b1d
This command was run using /home/hadoop/hadoop/share/hadoop/common/hadoop-common-2.8.2.jar

1.1 Download hive

After configuring hadoop successfully on your linux system. lets start hive setup. First download latest hive source code and extract archive using following commands.

Copy
$ cd /home/hadoop
$ wget http://archive.apache.org/dist/hive/hive-2.3.2/apache-hive-2.3.2-bin.tar.gz
$ tar xzf apache-hive-2.3.2-bin.tar.gz
$ mv apache-hive-2.3.2-bin hive

1.2 Set Environment Variables edit

Configure your environment variables to use Hive. Edit /home/hadoop/.bash_profile and add the following lines:

Copy
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_PREFIX=/home/hadoop/hadoop
export HIVE_HOME=/home/hadoop/hive
export PATH=$HIVE_HOME/bin:$PATH

1.3 Configure HDFS for Hive

Before running hive you need to create warehouse directory in HDFS and set them chmod g+w in HDFS before create a table in Hive. Use the following commands.

Copy
$ hdfs dfs -mkdir /user/hive/warehouse    
$ hdfs dfs -chmod g+w /user/hive/warehouse

2 Hive metastore

Configuring metastore means specifying to Hive where the database is stored.

All hive implementation need a metastore service, where it stores metadata. It is implemented using tables in relational database. By default, Hive uses built-in Derby SQL server. It provides single process storage, so when we use Derby we can not run instance of Hive CLI. Whenever we want to run Hive on a personal machine or for some developer task than it is good but when we want to use it on cluster then MYSQL or any other similar relational database is required.

Now when you run your hive query and you are using default derby database you will find that your current directory now contains a new sub-directory metastore_db. Also the metastore will be created if it doesn’t already exist.

The property of interest here is javax.jdo.option.ConnectionURL. The default value of this property is jdbc:derby:;databaseName=metastore_db;create=true.

This value specifies that you will be using embedded derby as your Hive metastore and the location of the metastore is metastore_db.

2.1 Config hive-site.xml

  • Copy the hive-default-xml template as hive-site.xml
    Copy
    $ cp $HOME/hive/conf/hive-default.xml.template $HOME/hive/conf/hive-site.xml
  • Set the default properties for tmpdir and user name on top of $HOME/hive/conf/hive-site.xml
    Copy
    <property>
        <name>system:java.io.tmpdir</name>
        <value>/tmp/${user.name}/java</value>
        </property>
    <property>
        <name>system:user.name</name>
        <value>${user.name}</value>
    </property>
    If you miss to setup this values you will get an exception when running hive.
    java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D    
    
  • We can also configure directory for hive store table information. By default, the location of warehouse is /user/hadoop/warehouse as specified in hive-site.xml.
    Copy
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/${user.name}/warehouse</value>
        <description>location of default database for the warehouse</description>
    </property>
  • Notice this is a location pointing to HDFS, so it must exists before you plan to create any database, so create the warehouse directory for hive in HDFS
    Copy
    $ hadoop fs -mkdir warehouse    
    $ hdfs dfs -ls -R /
    drwx-wx-wx   - deister supergroup          0 2018-02-10 23:31 /tmp
    drwx-wx-wx   - deister supergroup          0 2018-02-10 23:31 /tmp/hive
    drwx------   - deister supergroup          0 2018-02-10 23:33 /tmp/hive/deister
    drwxr-xr-x   - deister supergroup          0 2018-02-10 23:28 /user
    drwxr-xr-x   - deister supergroup          0 2018-02-10 23:32 /user/deister
    drwxr-xr-x   - deister supergroup          0 2018-02-10 23:32 /user/deister/workspace

2.2 Derby metastore

If derby database is not installed in your system, download, install and configure it:

  • Download derby:
    Copy
    $ wget http://archive.apache.org/dist/db/derby/db-derby-10.14.1.0/db-derby-10.14.1.0-bin.tar.gz
    $ tar xzf db-derby-10.14.1.0-bin.tar.gz
    $ mv db-derby-10.14.1.0-bin db-derby
  • Configure environment variables editing: $HOME/.bash_profile:
    Copy
    export DERBY_INSTALL=/home/hadoop/db-derby
    export DERBY_HOME=/home/hadoop/db-derby
    export PATH=$DERBY_HOME/bin:$PATH
  1. Create the Derby metastore using the following command.
    Copy
    $ schematool -dbType derby -initSchema
    Metastore connection URL:	 jdbc:derby:;databaseName=metastore_db;create=true
    Metastore Connection Driver :	 org.apache.derby.jdbc.EmbeddedDriver
    Metastore connection User:	 APP
    Starting metastore schema initialization to 2.3.0
    Initialization script hive-schema-2.3.0.derby.sql
    Initialization script completed
    schemaTool completed
  2. As you are setting up an Derby embedded metastore database, use the property below as JDBC URL in your hive-site.xml
    Copy

    /home/hadoop/hive/conf/hive-site.xml

    <property>
       <name>javax.jdo.option.ConnectionURL</name>
       <value>jdbc:derby:metastore_db;create=true </value>
       <description>JDBC connect string for a JDBC metastore </description>
    </property>

2.3 MySQL metastore

  1. Create a metastore database in MYSQL server.
    Copy
    mysql
    mysql> CREATE DATABASE metastore;
    mysql> USE metastore;
    mysql> CREATE USER 'hiveuser'@'localhost' IDENTIFIED BY 'password';
    mysql> GRANT SELECT,INSERT,UPDATE,DELETE,ALTER,CREATE ON metastore.* TO 'hiveuser'@'localhost';
  2. Add/Edit the following lines in your hive-site.xml
    Copy

    /usr/local/opt/hive/libexec/conf/hive-site.xml

    <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://localhost/metastore</value>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>hiveuser</value>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>password</value>
    </property>
    <property>
      <name>datanucleus.fixedDatastore</name>
      <value>false</value>
    </property>

3 Running hive

You can start hive by running

Copy
$ hive
...
Logging initialized using configuration in jar:file:/usr/local/Cellar/hive/2.3.1/libexec/lib/hive-common-2.3.1.jar!/hive-log4j2.properties Async: true
hive>

Or in debug mode

Copy
$ hive -hiveconf hive.root.logger=DEBUG,console

Once connect to hive, you can run some command to test it's running ok.

Copy
$ hive> show databases;
OK
default
Time taken: 9.579 seconds, Fetched: 1 row(s)
Hive contains a default database named default.

3.1 Create a database

Now, create a test database, then list contents of HDFS

Copy
hive> create database test;
OK
Time taken: 0.399 seconds

To make Hive CLI (command line interface) shows current database type:

Copy
set hive.cli.print.current.db=true;
hive (default)>

To make this feature persistent, edit hive-site.xml and set the property

Copy
<property>
    <name>hive.cli.print.current.db</name>
    <value>true</value>
</property>

You can see test.db unser warehouse directory

Copy
$ hdfs dfs -ls -R /
drwx-wx-wx   - deister supergroup          0 2018-02-10 23:31 /tmp
drwx-wx-wx   - deister supergroup          0 2018-02-10 23:31 /tmp/hive
drwx------   - deister supergroup          0 2018-02-10 23:35 /tmp/hive/deister
drwx------   - deister supergroup          0 2018-02-10 23:35 /tmp/hive/deister/3db7219f-d599-46be-b195-163f93374c8c
drwx------   - deister supergroup          0 2018-02-10 23:35 /tmp/hive/deister/3db7219f-d599-46be-b195-163f93374c8c/_tmp_space.db
drwxr-xr-x   - deister supergroup          0 2018-02-10 23:28 /user
drwxr-xr-x   - deister supergroup          0 2018-02-10 23:37 /user/deister
drwxr-xr-x   - deister supergroup          0 2018-02-10 23:37 /user/deister/warehouse
drwxr-xr-x   - deister supergroup          0 2018-02-10 23:37 /user/deister/warehouse/test.db
drwxr-xr-x   - deister supergroup          0 2018-02-10 23:32 /user/deister/workspace

4 Thoubleshooting

4.1 Can't start Hive cause hadoop is ins safemode

If hadoop is put in safemode, hive will not be able to start throwing an exception like

Copy
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /tmp/hive/hadoop/

To recover hadoop from safe mode type:

Copy
$ hdfs dfsadmin -safemode leave
Safe mode is OFF