Project structure

In Data Brewery your data environment is described by a set of XML files, some HOCOM (Human-Optimized Config Object Notation) files (think JSON on steroid) and SQL files.

One file project

In its simplest form, you can have the following file dw.xml that define a datastore and a module.

<datawarehouse>
  <datastore name="dw" type="h2" host="file"/>

  <module name="db" datastore="dw">
    <table name="dummy">
      <source type="query">select current_date</source>
    </table>
  </module>
</datawarehouse>

To help you get started, you can initiate a working seed (the example which will be detailled in the next pages) by calling the following command in a new directory :

ipa init simple

There is also a seed for this tutorial, so you can just call :

ipa init earthquake

Which will create the dw.xml file used for the rest of the tutorial and the final Excel report template.

You can continue to read this page to know more about complex project structuration or go directly to the first tutorial step accessing data with datastores.

Complex projects

While it work fine, when the projects gets bigger it make sense to break the file in many part.

Separated configuration

The first step is to introduce configurations files. By default, ipa read the dw.conf file in the current path. It allows to override existing or missing parameters.

For instance, having the following dw.xml and dw.conf files, the password for the mydb datastore will be “the_real_password” instead of the one declared in the XML.

<!-- dw.xml -->
<datawarehouse>
  <datastore name="mydb" type="postgresql" host="mydb.myproject.com" user="admin" password="mypassword"/>
</datawarehouse>
# dw.conf
dw.datastore.mydb {
  password = "the_real_password"
}

Environment configuration

Now, you may want to manage multiple environments. For instance, you might have a development environment where your data warehouse is stored on your local computer and using a production environment with the dedicated data warehouse. You can easily manage that scenario by using different configurations files. The dw.conf file will reference low sensitivity parameters while the prod.conf will reference the default dw.conf parameters and overload some of them with the productions parameters.

# dw.conf (dev)
dw.datastore.dw {
  host = "localhost"
  database = "datawarehouse"
  password = "the_real_password"
}
# conf/prod.conf (prod)

include "../dw.conf"

dw.datastore.dw {
  host = "dw.myproject.com"
  database = "datawarehouse"
  password = "the_real_password"
  sshUser = "myid"
  sshPrivateKeyLocation = "keys/etl"
}

In order run a process with ipa, you can set the main configuration file with the following command line to execute the nightly process in production mode.

ipa run-process -c conf/prod.conf nightly

Externalize SQL queries

SQL queries are core to Data Brewery projects. While it is possible to keep them in the project file, a common approach is to externalize the big ones (> 5 lines) to their own SQL file. Being XML, the project file doesn’t allow use of some common characters like ‘<’. One way is to protect the SQL query with CDATA block, but putting them in a distinct SQL file is easier and enable syntax highlightning.

To externalize a SQL query, you just need to link the file with the contentPath attribute. The path is relative to the file where it is located.

<module name="business" datastore="dw">
  <table name="d_date">
    <source type="query" contentPath="business/d_date.sql"/>
  </table>
</module>
-- business/d_date.sql
select to_char(datum,'yyyymmdd')::int as date_key,
  datum as "date",
  extract(epoch from datum)::bigint as epoch,
  to_char(datum,'Day') as day_name,
  extract(isodow from datum)::int as day_of_week,
  extract(day from datum)::int as day_of_month
  -- ...
from (select '2000-01-01'::date + day as datum
  from generate_series (0,29219) as day) t
order by 1 desc;

Project file breakdown

Finally, it is also possible to split the project file (the XML) in many part.

For that, you just need to insert an include element to link to a child file which will replace the include element.

For instance, the following project file :

<datawarehouse>
  <datastore name="dw" type="h2">
    <!-- The content -->
  </datastore>
</datawarehouse>

is equivalent to the combination of the next two ones.

<datawarehouse>
  <include path="store/dw.xml"/>
</datawarehouse>
<!-- store/dw.xml -->
<datastore name="dw" type="h2">
  <!-- The content -->
</datastore>

It is a good practice to keep project file below 100 lines of code.

Inside the path attribute, you can use basic pattern matching with the ‘*’ character. For instance, “dir//.xml” will include every file that is inside a directory in the “dir” directory and ends with “.xml”.

Full project structuration

By providing all the previously mentionned project structuration mecanism, Data Brewery encourage a lot a flexibility.

To wrap up, let’s present the idiomatic layout a complex project should have.

  • conf : configuration environment files (test.conf, prod.conf)

  • key : location to store SSH keys and others authentification mecanisms

  • module : location for modules

    • mod1.xml : Module mod1 definition

    • mod1 : directory for mod1 SQL files

      • d_date.sql : SQL file for the date dimension

      • f_kpi.sql : SQL file for the

  • store : location for datastores

  • dw.conf : Default configuration file (dev environment)

  • dw.xml : The main project file.

You can initialize such a structured project with the following command :

ipa init complex

Let’s now see the first step in the process : accessing data with datastores.