While based on
HiveQL does not strictly follow the full
HiveQL offers extensions not in
SQL, including multitable inserts and create
table as select, but only offers basic support for indexes.
HiveQL lacked support for transactions and materialized views, and only limited subquery support.
Support for insert, update, and delete with full ACID functionality was made available with release 0.14
Internally, a compiler translates
HiveQL statements into a directed acyclic graph of MapReduce,
Tez, or Spark jobs, which are submitted to Hadoop for execution.
Word Count Program example in Pig
input_lines = LOAD '/tmp/word.txt' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; filtered_words = FILTER words BY word MATCHES '\\w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/results.txt';
The following example shows the classic word count example writen in
DROP TABLE IF EXISTS docs; CREATE TABLE docs (line STRING); LOAD DATA INPATH 'input_file' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) temp GROUP BY word ORDER BY word;
A brief explanation of each of the
HiveQL statements is as follows:
- Checks if table docs exists and drops it if it does. Creates a new table called docs with a single column of type STRING called line.
- Loads the specified file or directory (In this case “input_file”) into the table. OVERWRITE specifies that the target table to which the data is being loaded into is to be re-written; Otherwise the data would be appended.
- The query CREATE TABLE word_counts AS SELECT word, count(1) AS count creates a table called word_counts with two columns: word and count. This query draws its input from the inner query (SELECT explode(split(line, '\s')) AS word FROM docs) temp". This query serves to split the input words into different rows of a temporary table aliased as temp. The GROUP BY WORD groups the results based on their keys. This results in the count column holding the number of occurrences for each word of the word column. The ORDER BY WORDS sorts the words alphabetically.
1.1 A simple example
Let's do some examples to get familiar with
1.1.1 Create a database
hive (default)> create database test;
1.1.2 Show databases
hive (default)> show databases;
1.1.3 Select a database
hive (default)> use test;
1.1.4 Create a table
hive (test)> CREATE TABLE IF NOT EXISTS test_table ( col1 int COMMENT 'Integer Column', col2 string COMMENT 'String Column' ) COMMENT 'This is test table' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
1.1.5 Insert data into table
hive (test)> insert into test_table values(1,'aaa'); insert into test_table values(2,'bbb');
1.1.6 Select data from table
hive (test)> select * from test_table;
We can see, insert operation has generated a mapreduce job but select has not. We can inspect hadoop application console and we can see the two jobs generated by two insert operations.
1.1.7 List tables
We can expect we have a table named
test_table, but we will see
two additional temporary tables, one foreach row inserted.
hive (test)> show tables;
Temp tables like these are created when hive needs to manage intermediate data during an operation. For example, for a table of type TEXTFILE, a temporary table gets automatically created, at the time when you try to do an INSERT.
They are temp tables and are session scoped and as you've noticed will go away when the session ends.
1.1.8 drop database
To drop database, simlply type:
DROP DATABASE test CASCADE;