Handling Big Data with HBase Part 2: First Steps
Posted on December 11, 2013 by Scott Leberknight
This is the second in a series of blogs that introduce Apache HBase. In the first blog, we introduced HBase at a high level. In this part, we'll see how to interact with HBase via its command line shell.
Let's take a look at what working with HBase is like at the command line. HBase comes with a JRuby-based shell that lets you define and manage tables, execute CRUD operations on data, scan tables, and perform maintenance among other things. When you're in the shell, just type help
to get an overall help page. You can get help on specific commands or groups of commands as well, using syntax like help <group>
and help command
. For example, help 'create'
provides help on creating new tables. While HBase is deployed in production on clusters of servers, you can download it and get up and running with a standalone installation in literally minutes. The first thing to do is fire up the HBase shell. The following listing shows a shell session in which we create a blog
table, list the available tables in HBase, add a blog entry, retrieve that entry, and scan the blog table.
$ bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.96.0-hadoop2, r1531434, Fri Oct 11 15:28:08 PDT 2013
hbase(main):001:0> create 'blog', 'info', 'content'
0 row(s) in 6.0670 seconds
=> Hbase::Table - blog
hbase(main):002:0> list
TABLE
blog
fakenames
my-table
3 row(s) in 0.0300 seconds
=> ["blog", "fakenames", "my-table"]
hbase(main):003:0> put 'blog', '20130320162535', 'info:title', 'Why use HBase?'
0 row(s) in 0.0650 seconds
hbase(main):004:0> put 'blog', '20130320162535', 'info:author', 'Jane Doe'
0 row(s) in 0.0230 seconds
hbase(main):005:0> put 'blog', '20130320162535', 'info:category', 'Persistence'
0 row(s) in 0.0230 seconds
hbase(main):006:0> put 'blog', '20130320162535', 'content:', 'HBase is a column-oriented...'
0 row(s) in 0.0220 seconds
hbase(main):007:0> get 'blog', '20130320162535'
COLUMN CELL
content: timestamp=1386556660599, value=HBase is a column-oriented...
info:author timestamp=1386556649116, value=Jane Doe
info:category timestamp=1386556655032, value=Persistence
info:title timestamp=1386556643256, value=Why use HBase?
4 row(s) in 0.0380 seconds
hbase(main):008:0> scan 'blog', { STARTROW => '20130300', STOPROW => '20130400' }
ROW COLUMN+CELL
20130320162535 column=content:, timestamp=1386556660599, value=HBase is a column-oriented...
20130320162535 column=info:author, timestamp=1386556649116, value=Jane Doe
20130320162535 column=info:category, timestamp=1386556655032, value=Persistence
20130320162535 column=info:title, timestamp=1386556643256, value=Why use HBase?
1 row(s) in 0.0390 seconds
In the above listing we first create the blog
table having column families info
and content
. After listing the tables and seeing our new blog
table, we put some data in the table. The put
commands specify the table, the unique row key, the column key composed of the column family and a qualifier, and the value. For example, info
is the column family while title
and author
are qualifiers and so info:title
specifies the column title
in the info
family with value "Why use HBase?". The info:title
is also referred to as a column key. Next we use the get
command to retrieve a single row and finally the scan
command to perform a scan over rows in the blog
table for a specific range of row keys. As you might have guessed, by specifying start row 20130300
(inclusive) and end row 20130400
(exclusive) we retrieve all rows whose row key falls within that range; in this blog
example this equates to all blog entries in March 2013 since the row keys are the time when an entry was published.
An important characteristic of HBase is that you define column families, but then you can add any number of columns within that family, identified by the column qualifier. HBase is optimized to store columns together on disk, allowing for more efficient storage since columns that don't exist don't take up any space, unlike in a RDBMS where null values must actually be stored. Rows are defined by columns they contain; if there are no columns then the row, logically, does not exist. Continuing the above example in the following listing, we delete some specific columns from a row.
hbase(main):009:0> delete 'blog', '20130320162535', 'info:category'
0 row(s) in 0.0490 seconds
hbase(main):010:0> get 'blog', '20130320162535'
COLUMN CELL
content: timestamp=1386556660599, value=HBase is a column-oriented...
info:author timestamp=1386556649116, value=Jane Doe
info:title timestamp=1386556643256, value=Why use HBase?
3 row(s) in 0.0260 seconds
As shown just above, you can delete a specific column from a table as we deleted the info:category
column. You can also delete all columns within a row and thereby delete the row using the deleteall
shell command. To update column values, you simply use the put
command again. By default HBase retains up to three versions of a column value, so if you put
a new value into info:title
, HBase will retain both the old and new version.
The commands issued in the above examples show how to create, read, update, and delete data in HBase. Data retrieval comes in only two flavors: retrieving a row using get
and retrieving multiple rows via scan
. When retrieving data in HBase you should take care to retrieve only the information you actually require. Since HBase retrieves data from each column family separately, if you only need data for one column family, then you can specify to retrieve only that bit of information. In the next listing we retrieve only the blog titles for a specific row key range that equate to March through April 2013.
hbase(main):011:0> scan 'blog', { STARTROW => '20130300', STOPROW => '20130500', COLUMNS => 'info:title' }
ROW COLUMN+CELL
20130320162535 column=info:title, timestamp=1386556643256, value=Why use HBase?
1 row(s) in 0.0290 seconds
So by setting row key ranges, restricting the columns we need, and restricting the number of versions to retrieve, you can optimize data access patterns in HBase. Of course in the above examples, all this is done from the shell, but you can do the same things, and much more, using the HBase APIs.
Conclusion to Part 2
In this second part of the HBase introductory series, we saw how to use the shell to create tables, insert data, retrieve data by row key, and saw a basic scan of data via row key range. You also saw how you can delete a specific column from a table row.
In the next blog, we'll get an overview of HBase's high level architecture.
References
- HBase web site, http://hbase.apache.org/
- HBase wiki, http://wiki.apache.org/hadoop/Hbase
- HBase Reference Guide http://hbase.apache.org/book/book.html
- HBase: The Definitive Guide, http://bit.ly/hbase-definitive-guide
- Google Bigtable Paper, http://labs.google.com/papers/bigtable.html
- Hadoop web site, http://hadoop.apache.org/
- Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide
- Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
- HBase lightning talk slides, http://www.slideshare.net/scottleber/hbase-lightningtalk
- Sample code, https://github.com/sleberknight/basic-hbase-examples