Staging All Tables from an RDS Database

Staging all Tables from an RDS Database  

A common request is to load all data from an RDS database. This article looks at loading all data from a MySQL RDS database but the concept is the same for all RDS databases. This example takes data from an RDS database called classicmodels which has a series of tables in it containing data on classic car orders.  

The RDS Query component will only load one table at a time however a table iterator can be used to load all tables from a given database.  

The first step in this is to obtain the name of all tables in the RDS database. This can either be a manually maintained list, or can be obtained by querying the information schema. This query can be put in an RDS Query component to write the results back to a temporary table:  

 

 

The result of this is a table containing the name of all tables in the classicmodels database. This table can be linked to a Table Iterator component in Matillion which can be used to loop through those tables. In the table Table Iterator, the column table_name needs to be mapped to a variable:

 

The Table Iterator is linked to another RDS Query component with a simple SELECT * FROM query to pull all data from the table being iterated through. The table this is written to should contain the name of the variable to avoid having one table continually overwritten:  

This will create a series of new tables with the contents of the classic models database which can then be used in a transformation job.  

Finally, we recommend using a Delete Tables component to tidy up the classic_models_table used by the Table Iterator:

 

This article describes the steps to build a complete replication of your RDS database.  

<insert job>  

Incremental Loads

The next step after taking a one off copy is to do incremental load on all tables. This requires checking the last modified date on the table and feeding this into a variable to be used in the RDS Query Component.  

This can be done using a simple Python script within a Python script component:  

print table
cursor = context.cursor()
cursor.execute('select max(lastmodified) from ' + table)
lastmodified = cursor.fetchone()[0]  

print lastmodified  
context.updateVariable('LastModified', lastmodified)  

This is putting the last modified date from the table name from the variable ${table} into a variable created called ${LastModified}

 

This can then be sent to the RDS Query component to only pull through the recently modified data:  

 

This nested Orchestration job can then be called from within the iterator:

 

An example of this job is available at the bottom of the page.

Then if required the staged tables can be merged into persistent tables using our slowly changing dimensions example here.