Tajo : A Distributed Data Warehouse System on Large Clusters

  1. Home
  2. Projects
  3. Tajo : A Distributed Data Warehouse System on Large Clusters


  • [2014-04-01] Tajo enters Apache Top-Level Project.
  • [2013-04-08] Tajo shows its demonstration in ICDE 2013.
  • [2013-03-07] Tajo entered the Apache Incubator.

What is Tajo?

Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo uses HDFS as a primary storage layer and has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities.


  • Scalability – Tajo uses Hadoop Distributed File System (HDFS) as a primary storage layer. Tajo incorporates the advantages of MapReduce and shared-nothing parallel databases to yield the scalability.
  • Low latency – We have two goals for low-latency queries. The first goal is to allow users to get estimates of an aggregate query in an online fashion as soon as the query is submitted. This is feasible if a user wants a quick picture rather than exact results. The second goal is efficient query processing. We achieve it with various query evaluation strategies, query optimization, high throughput engine, and and efficient I/O.
  • In-situ processing – Hadoop Distributed File System (HDFS) has played a role of the centralized data storage for data intensive computing. Collected log data and data streams are usually stored into HDFS. Tajo provides a scalable and low-latency means to processes them on location without ETL and additional data loading.
  • Fault tolerance – Long-running queries are also required to process big data. Tajo supports the fault tolerance to avoid a complete query restart in the case that the query fails.



  • – Hyunsik Choi, Jihoon Son, Haemi Yang, Hyoseok Ryu, and Yon Dohn Chung, Tajo: A Distributed Data Warehouse System on Large Clusters (demo), 29th IEEE International Conference on Data Engineering (ICDE), Brisbane, Australia, April 8-12, 2013. (PDF) (Poster)

See Also