This article covers how to build a simple ActiveWarehouse data warehouse using the Subversion logs from the Rails project. This article assumes that you have developed a Rails application before and that you understand the basics of how Rails works. If you need to learn more about Rails in general then I suggest visiting the Ruby on Rails website and get working knowledge of Rails before trying to dive into building a data warehouse on it.

Step 1 – Get the Raw Data

The first step is to get the raw data out of the Subversion repository.

svn log http://dev.rubyonrails.org/svn/rails/trunk --xml -v --revision 1:"HEAD" > ~/rails-log.xml

This will export the Subversion log starting at revision 1 and going all the way to the HEAD revision, in XML format, with the file changes included. The data will be saved in the file rails-log.xml in the current user’s home directory.

Step 2 – Design the Warehouse

Now that we have the data we can design the data warehouse. There are various books available which cover the ins and outs of developing data warehouses, however ActiveWarehouse primarily uses techniques described in The Data Warehouse Toolkit.

A data warehouse consists primarily of facts and dimensions. Facts represent discreet facts which can be pulled from the raw data whereas dimensions are ways of slicing and dicing the facts.

First the dimensions:

  • Date
  • File
    • Path
    • Filename
    • Extension
    • File Type
    • Framework (ActiveRecord, ActionPack, etc)
  • Change Type
    • Type Code
    • Type Description
  • Author
    • Name

The grain of the fact table will be one entry per file comitted.

The fact table will initially be a factless fact table:

  • Date ID (FK)
  • File ID (FK)
  • Change Type ID (FK)
  • Author ID (FK)
  • Revision (DD)
  • File Changed (will always be 1)

Note: FK is a foreign key (to the dimension) and DD is a degenerative dimension (which is a fancy name for a dimension which really has no values – it is reserved for transactional IDs).

In the future we will augment this fact table to include the number of lines which were added and deleted.

Step 3 – Construct The Warehouse

First create a fresh Rails app:

rails rails_warehouse

Next install the ActiveWarehouse plugin. Change into the rails_warehouse directory which was just created and execute:

script/plugin install svn://rubyforge.org/var/svn/activewarehouse/activewarehouse/trunk

You must have the Subversion commnd line client installed for this to work.

Create a database and/or modify your config/database.yml file. Then run the warehouse migration. This will create any tables required by ActiveWarehouse:

rake warehouse:migrate

Now let’s build the warehouse objects based on the design we created in Step 2.

script/generate dimension date
script/generate dimension file
script/generate dimension change_type
script/generate dimension author

The DateDimension, FileDimension, ChangeTypeDimension and AuthorDimension classes will be created in the app/models directory along with migration and unit test stubs. Note that dimension tables are in singular form. This fits with common practices and makes the most sense. As always though you could change it by setting the table name yourself in the migrations and in the model.

Let’s go ahead and modify the migrations now.

Open db/migrate/001_create_date_dimension.rb. The migration stub will look like this:

class CreateDateDimension < ActiveRecord::Migration
  def self.up
    fields = {
      # Add dimension attributes here as name => type
      # Example: :store_name => :string
    }
    create_table :date_dimension do |t|
      fields.each do |name,type|
        t.column name, type
      end
    end
    fields.each do |name,type|
      add_index :date_dimension, name unless type == :text      
    end
  end

  def self.down
    drop_table :date_dimension
  end
end

All we have to do is add the fields to the fields map in the self.up for now:

    fields = {
      :date => :string,
      :day_of_week => :string,
      :day_number_in_calendar_month => :integer,
      :day_number_in_calendar_year => :integer,
      :calendar_week => :string,
      :calendar_week_ending_date => :string,
      :calendar_week_number_in_year => :integer,
      :calendar_month_name => :string,
      :calendar_month_number_in_year => :integer,
      :calendar_year_month => :string,
      :calendar_quarter => :string,
      :calendar_year_quarter => :string,
      :calendar_year => :string,
      :major_event => :string,
      :sql_date_stamp => :date
    }

This is a fairly simple date dimension. It only deals with calendar year but that should be sufficient for our needs at the moment. If additional dimension attributes are needed they can easily be added later without needing to change the fact table, which is where most of the data will reside.

The field map of our FileDimension migration looks like this:

    fields = {
      :path => :string,
      :directory => :string,
      :file_name => :string,
      :file_base => :string,
      :file_type => :string,
      :extension => :string,
      :framework => :string
    }

The ChangeTypeDimension field map:

    fields = {
      :change_type_code => :string,
      :change_type_description => :string
    }

And finally the AuthorDimension field map:

    fields = {
      :name => :string
    }

In the future we may want to work a bit more on the author dimension and differentiate between an author and a committer, but for now this will do.

Next let’s generate the fact model:

script/generate fact file_revision

Just like the dimension generator, the fact generator will create the model in app/models in the file file_revision_fact.rb. It will also create migration and unit test stubs. Let’s work on the fact migraion now. The fact migration currently looks like any other ActiveRecord migration. Here is what it will look like after our changes:

class CreateFileRevisionFacts < ActiveRecord::Migration
  def self.up
    create_table :file_revision_facts do |t|
      t.column :date_id, :integer, :null => false
      t.column :file_id, :integer, :null => false
      t.column :change_type_id, :integer, :null => false
      t.column :author_id, :integer, :null => false
      t.column :revision, :string, :null => false
      t.column :file_changed, :integer, :default => 1
    end
    add_index :file_revision_facts, :date_id
    add_index :file_revision_facts, :file_id
    add_index :file_revision_facts, :change_type_id
    add_index :file_revision_facts, :author_id
  end

  def self.down
    drop_table :file_revision_facts
  end
end

Note that unlike dimensions, fact tables are pluralized. This makes sense because the table holds a bunch of facts.

Now that the basics are set up, let’s run our migration:

rake db:migrate

Step 4 – Import the Raw Data

Next it is necessary to import the data into the data warehouse. Integration of data is often the most difficult and time-consuming task when you are developing a data warehouse. Data can arrive in a wide variety of formats, and the quality of incoming data often leaves much to be desired. There is a separate component in the ActiveWarehouse project called ActiveWarehouse ETL. While you do not necessarily need to use the ActiveWarehouse ETL, it may simplify certain tasks required during the ETL process. For this article we will be using ActiveWarehouse ETL. You can install the current version of ActiveWarehouse ETL using gem: gem install activewarehouse-etl. More information on ActiveWarehouse ETL can be found at http://activewarehouse.rubyforge.org/etl.

Once you’ve installed the ActiveWarehouse ETL tool you should first create a db/etl directory. You could put your ETL scripts anywhere, but I am selecting the db directory because it makes the most sense to me. Next you need to create ETL control files. I am going to process the Rails Subversion log in multiple passes, once for each dimension.

First the author dimension:

# Control file for the Rails Subversion log (in XML format)

source :in, {
  :file => 'rails-log.xml',
  :parser => :xml
}, 
{
  :collection => 'log/logentry',
  :fields => [
    {
      :name => :name,
      :xpath => 'author'
    }
  ]
}

destination :out, {
  :file => 'author_dimension.txt',
  :unique => [:name],
  :separator => "\t" 
}, 
{
  :order => [:name],
  #:virtual => {
  #  :id => :surogate_key
  #}
}

post_process :bulk_import, {
  :file => 'author_dimension.txt',
  :truncate => true,
  :columns => [:name],
  :target => {
    :adapter => 'mysql',
    :database => 'rails_warehouse_development',
    :table => 'author_dimension',
    #:username => 'root'
    #:host => 'localhost',
    #:password => 'password'
  }
}

That’s a pretty complex file there, let’s look at what it is doing. It has three directives: the source, the destination and a post processing directive.

ETL Source Directive

The source directive takes 3 arguments: the name of the source (any unique name, in this case :in), the source configuration and the source definition. The source configuration and source definitions are hashes with various options defined.

In the configuration the :file option is a relative path (from the control file) to the source file. In this case it is a file called rails-log.xml which is in the db/etl directory. The :parser option defines the parser to use. ActiveWarehouse ETL currently supports three parsers: fixed_width, delimited and xml. Each has various options, however that is for another article. For now we are using the xml parser since the raw data is in XML format.

The source definition hash varies depending on what type of parser is used. In this case there are two options. The first is the :collection option which defines the XPath to the element which represents each entity in a collection to be loaded. The second option is the :fields option which is mapped to a list. Each item in the list can either be a symbol or a hash. If it is a symbol then it will create a field with the symbol name, using the name as the expected element name and using the :string type. If it is a hash (as it is in this case) then you specify those field options explicitly. The field options are :name, :xpath (which can be used for a custom xpath to the value) and :type which will define the type (using ActiveRecord types).

ETL Destination Directive

The destination directive defines where the processed data will be sent. Like the source the destination takes three arguments: the destination unique name, the hash of configuration options and the mapping hash. The ActiveWarehouse ETL currently supports writing to files or directly to a database. In this case we will write to a file called author_dimension.txt. The fields will be ordered according to the :order array in the mapping hash. Additionally it will force the name field to be unique in the file.

There is much more you can do with the destination directive, but for now the information above should suffice.

ETL Post Processing Directive

The post processing directive lets you do additional processing after the data has been processed and written to its destination. In this case the post processing will use the MySQL bulk loading functionality to load the data. Currently only MySQL bulk loading is supported, however in the future other RDBMS will be supported.

Executing the Load

Once the control file is created you can load it using the etl command:

etl author_dimension.ctl

This should populate the dimension. On my machine it takes about 75 seconds, which is a little over 1 second per 1000 lines. A large amount of the time is spent parsing the XML document, on my machine anywhere from 25 to 35 seconds.

Each dimension may require different handling The load files can be found in the downloadable Rails Warehouse application. The file dimension is handled like the author dimension, by parsing the XML document and extracting the required data. The change type dimension uses a simple flat file load. The date dimension is populated using a Rake task which is included with ActiveWarehouse (in this case rake warehouse:build_date_dimension START_DATE='11/01/2004').

The Fact Table

Now that the dimensions are populated the fact table can be populated. Unfortunately getting the facts out of the XML file is a bit more complicated than with the dimensions. The built in parser cannot handle the more complex requirements of connecting the fact to the appropriate dimensions. There is an easy solution though: create a custom parser which can be plugged into the ETL system. The following code is placed in a file called file_revision_parser.rb which is located in the same directory as the file_revision_facts.ctl control file.

$:.unshift(File.dirname(__FILE__))
require File.dirname(__FILE__) + '/../../config/environment'

module ETL
  module Parser
    class FileRevisionParser < ETL::Parser::Parser
      include Enumerable
      def initialize(source)
        super
      end

      def each
        Dir.glob(file).each do |file|
          doc = nil
          t = Benchmark.realtime { doc = REXML::Document.new(File.new(file)) }
          Engine.logger.debug "XML parsed in #{t}s" 

          doc.elements.each('log/logentry') do |log_entry_element|
            revision = log_entry_element.attribute('revision').value
            date_id = lookup_date_id(log_entry_element.text('date'))
            author_id = lookup_author_id(log_entry_element.text('author'))
            log_entry_element.each_element('paths/path') do |path_element|
              row = {:file_changed => 1}
              row[:revision] = revision
              row[:author_id] = author_id
              row[:date_id] = date_id
              row[:file_id] = lookup_file_id(path_element.text)
              row[:change_type_id] = lookup_change_type_id(path_element.attribute('action').value)
              yield row
            end
          end
        end
      end

      private
      def lookup_date_id(date)
        @dates ||= {}
        d = Time.parse(date)
        ds = d.strftime("%m/%d/%Y")
        # date_dimension = DateDimension.find_by_date(ds)
        unless @dates[ds]
          date_dimension = DateDimension.find_by_sql_date_stamp(d)
          raise "Date not found for date string #{date}" unless date_dimension
          @dates[ds] = date_dimension
        end
        @dates[ds].id
      end

      def lookup_author_id(name)
        @authors ||= AuthorDimension.find(:all)
        @authors.each do |author|
          return author.id if author.name == name
        end
        raise "Author not found for name #{name}" 
      end

      def lookup_change_type_id(action)
        @change_types ||= ChangeTypeDimension.find(:all)
        @change_types.each do |change_type|
          return change_type.id if change_type.change_type_code == action
        end
        raise "Change type not found for action #{action}" 
      end

      def lookup_file_id(path)
        @files ||= FileDimension.find(:all)
        @files.each do |file_type|
          return file_type.id if file_type.path == path
        end
        raise "File not found for path #{path}" 
      end

    end
  end
end

And then the control file:

# Control file for creating the file revision facts from the Rails Subversion log (in XML format)

require 'file_revision_parser'

source :in, {
  :file => 'rails-log.xml',
  :parser => :file_revision
}, 
{}

destination :out, {
  :file => 'file_revision_facts.txt',
  :separator => "\t" 
}, 
{
  :order => [:date_id,:file_id,:change_type_id,:author_id,:revision,:file_changed]
}

 post_process :bulk_import, {
   :file => 'file_revision_facts.txt',
   :truncate => true,
   :columns => [:date_id,:file_id,:change_type_id,:author_id,:revision,:file_changed],
   :target => {
     :adapter => 'mysql',
     :database => 'rails_warehouse_development',
     :table => 'file_revision_facts'
   }
 }

The first line of the control file requires the file_revision_parser source code. The source directive then defines the parser using :parser => :file_revision which will be converted at execution time to the class name ETL::Parser::FileRevisionParser. The remainder of the control file has directives similar to the previous control files. To populate the fact data execute the command etl file_revision_facts.ctl.

Step 5 – Fact and Dimension Model Directives

The next step is to modify the fact and dimension model classes to add some declarations which will be used by the ActiveWarehouse plugin when rendering reports. Dimensions can include one or more directives defining hierarchies which can be drilled down. For example, the DateDimension we will declare a hierarchy drilling down through the calendar year:

class DateDimension < ActiveWarehouse::Dimension
  define_hierarchy :cy, [:calendar_year,:calendar_quarter,:calendar_month_name,:calendar_week,:day_of_week]
end

This define_hierarchy declaration will allow the user to drill down from calendar year, to calendar quarter, all the way down to day of the week. You must define at least one hierarchy for each dimension if it will be used in a report, even if that hierarchy only has one level.

Facts require different types of directives. The aggregate directive indicates that a field should be aggregated. This means it will be summed in each level of the hierarchy. In our fact there is only one aggregate field, the :file_changed field. You may also want to add ActiveRecord belongs_to directives to your fact so that you will have those relationships defined and available should you choose to list the facts directly. The final FileRevisionFact model looks like this:


class FileRevisionFact < ActiveWarehouse::Fact
  aggregate :file_changed

  belongs_to :date, :foreign_key => 'date_id', :class_name => 'DateDimension'
  belongs_to :file, :foreign_key => 'file_id', :class_name => 'FileDimension'
  belongs_to :change_type, :foreign_key => 'change_type_id', :class_name => 'ChangeTypeDimension'
  belongs_to :author, :foreign_key => 'author_id', :class_name => 'AuthorDimension'
end

Facts can also define calculated fields which will be covered in a future article.

Step 6 – Cubes

Now that you have your fact and dimensions it is time to define a cube which will provide the basis for slicing and dicing the data. A cube reports on a fact and pivots on two or more dimensions. We’ll start with a cube that allows us to see reports by date and author. Generating a cube is accomplished using the following command: script/generate cube RevisionsByAuthor. This will create a RevisionsByAuthorCube model and unit test stub. Once the model is created you must tell it what it will be reporting on and pivoting on:

class RevisionsByAuthorCube < ActiveWarehouse::Cube
  reports_on :file_revision
  pivots_on :date, :author
end

Step 7 – Controller and View

Now that the cube has been created it is time to build the controller and view.

script/generate controller RevisionReports index

Cubes are rendered in reports and reports are configured in the controller. Change the RevisionReportsController to look like this:

class RevisionReportsController < ApplicationController

  def index
    @report = ActiveWarehouse::Report::TableReport.new(
      :title => "Revisions by Author",
      :cube_name => :revisions_by_author, 
      :column_dimension_name => :date, 
      :row_dimension_name => :author
    )
  end
end

You are creating a table report for the specified cube and dimensions. The @report instance variable will then be passed to the render_report method which is part of the ReportHelper class (that is automatically included):

<h1>Revisions By Author</h1>
<%= render_report @report %>

There are numerous features which can be specified both in the report as well as options for the render_report method, but for this article I will keep it simple.

Step 8 – Rebuild the Warehouse

Now that all of the code is in place it is time to “rebuild” the warehouse. This is the step which populates all of the aggregate data required by the warehouse. From the command line enter rake warehouse:rebuild. ActiveWarehouse will create the necessary tables and automatically populate them for all of the required aggregate variations based on the directives provided in the cubes you define.

Step 9 – See it in Action

Launch your server if it is not already running and point your browser to http://localhost:3000/revision_reports and you should be seeing something like this:

Congratulations, you’ve built your first data warehouse with ActiveWarehouse!

This is just the start but hopefully it has given you an idea of what can be accomplished with ActiveWarehouse. In future articles I will cover specific features in depth and will also cover new developments in the ActiveWarehouse library. Feel free to drop me a line at anthonyeden@gmail.com if you have any comments or other feedback.

Update 1: The rails_warehouse code is available from the ActiveWarehouse subversion repository: svn checkout svn://rubyforge.org/var/svn/activewarehouse/rails_warehouse/trunk. You can view it online too.

Sorry, comments are closed for this article.