While based on SQL, HiveQL does not strictly follow the full SQL-92 standard. HiveQL offers extensions not in SQL, including multitable inserts and create table as select, but only offers basic support for indexes. HiveQL lacked support for transactions and materialized views, and only limited subquery support.

Support for insert, update, and delete with full ACID functionality was made available with release 0.14

Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce, Tez, or Spark jobs, which are submitted to Hadoop for execution.

1 Introduction

Word Count Program example in Pig

Copy
input_lines = LOAD '/tmp/word.txt' AS (line:chararray);
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES '\\w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/results.txt';

The following example shows the classic word count example writen in HiveQL.

Copy
DROP TABLE IF EXISTS docs;
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'input_file' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
 (SELECT explode(split(line, '\s')) AS word FROM docs) temp
GROUP BY word
ORDER BY word;

A brief explanation of each of the HiveQL statements is as follows:

  1. Checks if table docs exists and drops it if it does. Creates a new table called docs with a single column of type STRING called line.
  2. Loads the specified file or directory (In this case “input_file”) into the table. OVERWRITE specifies that the target table to which the data is being loaded into is to be re-written; Otherwise the data would be appended.
  3. The query CREATE TABLE word_counts AS SELECT word, count(1) AS count creates a table called word_counts with two columns: word and count. This query draws its input from the inner query (SELECT explode(split(line, '\s')) AS word FROM docs) temp". This query serves to split the input words into different rows of a temporary table aliased as temp. The GROUP BY WORD groups the results based on their keys. This results in the count column holding the number of occurrences for each word of the word column. The ORDER BY WORDS sorts the words alphabetically.

1.1 A simple example

Let's do some examples to get familiar with HiveQL.

1.1.1 Create a database

Copy
hive (default)> create database test;
OK
Time taken: 0.246 seconds

1.1.2 Show databases

Copy
hive (default)> show databases;
OK
default
test
Time taken: 0.219 seconds, Fetched: 2 row(s)

1.1.3 Select a database

Copy
hive (default)> use test;
OK
Time taken: 0.015 seconds

1.1.4 Create a table

Copy
hive (test)>
CREATE TABLE IF NOT EXISTS test_table
(
    col1 int COMMENT 'Integer Column',
    col2 string COMMENT 'String Column'
)
COMMENT 'This is test table'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
OK
Time taken: 1.145 seconds

1.1.5 Insert data into table

Copy
hive (test)>
insert into test_table values(1,'aaa');
insert into test_table values(2,'bbb');
Query ID = deister_20180211130223_8aed832f-7f3d-47fc-9d5f-3b087b481f89
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there is no reduce operator
Starting Job = job_1518350430835_0003, Tracking URL = http://iMac-test.local:8088/proxy/application_1518350430835_0003/
Kill Command = /usr/local/Cellar/hadoop/2.8.2/libexec/bin/hadoop job  -kill job_1518350430835_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-02-11 13:02:38,711 Stage-1 map = 0%,  reduce = 0%
2018-02-11 13:02:47,416 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_1518350430835_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:9000/user/deister/warehouse/test_table/.hive-staging_hive_2018-02-11_13-02-23_586_3481836248754644322-1/-ext-10000
Loading data to table default.test_table
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   HDFS Read: 4269 HDFS Write: 80 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 26.303 seconds
...

1.1.6 Select data from table

Copy
hive (test)> select * from test_table;
OK
1	aaa
2	bbb
Time taken: 0.217 seconds, Fetched: 2 row(s)

We can see, insert operation has generated a mapreduce job but select has not. We can inspect hadoop application console and we can see the two jobs generated by two insert operations.

1.1.7 List tables

We can expect we have a table named test_table, but we will see two additional temporary tables, one foreach row inserted.

Copy
hive (test)> show tables;
OK
test_table
values__tmp__table__1
values__tmp__table__2
Time taken: 0.646 seconds, Fetched: 3 row(s)

Temp tables like these are created when hive needs to manage intermediate data during an operation. For example, for a table of type TEXTFILE, a temporary table gets automatically created, at the time when you try to do an INSERT.

They are temp tables and are session scoped and as you've noticed will go away when the session ends.

1.1.8 drop database

To drop database, simlply type:

Copy
DROP DATABASE test CASCADE;