Manage CDC
    • Dark
      Light

    Manage CDC

    • Dark
      Light

    Article Summary

    Overview

    Change Data Capture (CDC) is a Snowflake and Amazon Redshift for AWS only function, which uses AWS DMS (Database Migration Service) and Amazon S3 to check for updates to the source database and update the relevant tables within Matillion ETL.

    CDC in Matillion ETL works by DMS, checking your source database for changes, and recording those changes on an S3 bucket. This S3 bucket then triggers an event that causes Matillion ETL to pull data from the bucket and update tables accordingly. Thus, any changes made to the source database will be reflected in their respective data warehouse tables. All changes are incremental, with a default of five minutes between updates.

    Note
    • This feature is currently only available to AWS-hosted Matillion ETL instances.
    • CDC allows users to synchronise their data with their own database, ensuring the former is continually and automatically kept up to date with the latter.
    • CDC will incur some additional costs, namely for a DMS (Database Migration Service) Replication Instance and a Lambda Function, which are required for the CDC process.

    Manage Change Data Capture

    To access the Manage CDC Panel in Matillion ETL, browse to Project Menu → Manage Change Data Capture.

    This should open a CDC tab in the main panel of the client. Any existing CDC Tasks are listed here, along with their Identifier, Status, and Source. These details will not be updated automatically, and must be refreshed using the cyclical refresh button. All CDC Tasks and configurations are Project-specific.

    Additional Information for CDC

    For a CDC Task to be fully functional, it requires:

    Inside AWS Console:

    1. The source database to be properly configured. Source tables must have primary keys.
    2. A DMS Replication instance to be set up by the user. This is used to 'listen' to the source database for changes and then write these changes to S3.
    3. A FIFO Queue to be set up by the user that is not being used for any other purpose other than CDC Tasks on this specific Matillion ETL Project.
    4. An S3 bucket.
    5. Instance Credentials, Lambda Role ARN, and S3 Role ARN, giving correct permissions to be set up by the user. (See sections: Manage CDC Credentials, Lambda Role ARN, S3 Role ARN). A full list of all permissions required can be found on the IAM Roles & Permissions article.

    Inside the Matillion ETL client:

    1. CDC Configuration to be correctly Managed in the Matillion ETL client. (See section: New CDC Configurations).
    2. A new CDC Task to be correctly set up in the Matillion ETL (See section: Adding a new CDC Task).

    CDC Tasks work differently depending on whether a Task is in the initial run or not, illustrated by the below diagrams.

    Initial Run

    On the CDC task's initial run, the entire table is loaded into S3 which Matillion ETL then gathers the table structure for and uses it to construct a table in the data warehouse.

    Subsequent Runs (Incremental Load)

    After its initial run, CDC continues to write any changes to the source database into S3 which a lambda function writes back out with line numbers. Matillion ETL periodically checks the S3 Out bucket for new files and merges them into the target table(s).


    New CDC Configurations

    Before creating a new CDC Task, you must enter a CDC Configuration that tells the Matillion ETL Project which credentials to use to listen for incoming data. The plus button to add new Tasks will be unavailable until this step is completed. By clicking Manage will open a menu that allows the user to edit the CDC Configuration.

    Please provide the details of the parameters required.

    • Credentials: The AWS Credentials used for various AWS resources including; Lambda Functions, listing Replication Instances, stopping/starting DMS Tasks and S3 Buckets and deleting resources when removing a task. Clicking Manage will allow you to create or edit user-defined credentials. More information on these credentials can be found in the Manage CDC Credentials section below.
    • Region: The AWS Region that contains the FIFO queue you wish to use.
    • Create Queue: Check this box to create a Queue, for example if the Queue is not stored within the CDC Queue config. If this box is checked, it will supersede the "Listen Queue" field.
    • Listen Queue: The FIFO Queue to be used to listen for incoming changes.
    Note

    This FIFO Queue MUST only be used for CDC on this specific Project and nothing else.

    • **Lambda Role ARN: ** The ARN of an existing IAM role to use as the AWS Lambda Execution Role. Ensure it has read/write access to the S3 buckets you wish to use for CDC. This Role must have been set up by the user in advance. For more information, please see the Lambda Role ARN section below.

    New CDC Tasks Configuration

    After completing the CDC Configuration, clicking the plus button will let you set up a new CDC Task. CDC Tasks will cause a number of job queues equal to the number of tables being updated. Care should be taken when keeping track of large numbers of tables.
    Please follow the steps below to create a new CDC task:

    1. Task Configuration Setup

    After clicking the plus button, Change Data Capture Setup window opens. Provide the Task Configuration details as required and click Next.

    • Task Name: An arbitrary name given to your new CDC Task. This name will be used for the newly created Job in the current Project as well as the name of the DMS Task.
    • Replication Instance: Select the name of your Replication Instance. This must be associated with the credentials used in your CDC Configuration.

    2. Source Endpoint Configuration

    Please provide the details to setup the source endpoint and click Next.

    • **Endpoint Name: **An arbitrary name given to your new DMS Endpoint that points to your Source Database.
    • **Database Type: **The type of database to be used as a data source. Currently supported database types are MySQL, Oracle, SQL Server (Microsoft Driver), and PostgreSQL.
    • **Host Address: **The address of your source database.
    • **Port: **The port number for connecting to your source database. Not always required.
    • **Database: **The name of the database to connect to.
    • **SSL Mode: **The process of verification for your SSL connection. verify-ca will verify that the server is trustworthy by checking the certificate chain up to a trusted certificate authority (CA). **verify-full ** will also verify that the server host name matches its certificate. The SSL connection will fail if the server certificate cannot be verified. verify-full is recommended in most cases.
    • Username: The username to log into your source database. Not always required.
    • Password: The password to log into your source database. Not always required.

    Scrolling down, you can find the following Connection Option properties which can be edited to add or remove connection options:

    • JDBC Connection Options: User-defined JDBC connection options for the Database Type of choice that Matillion ETL uses in its connection to the source database during the initial CDC run.

      • Microsoft SQL Server - see this article for instructions on managing drivers.
    • **DMS Connection Options: ** User-defined ODBC connection options for the Database Type of choice that the Replication Instance uses in its connection to the source database.

    These options can be tested using the Test button beneath.
    Clicking Next will take a minute to confirm your configuration before moving on to the next part of the setup.

    3. Creating Source Endpoint

    On this window, Matillion ETL is attempting to validate the process and create a DMS Endpoint. The output status will be Successfull if the connection validated and status will be Failed if it cannot, or if the source database cannot be reached. Once validation process ends, click Next.

    4. Target Endpoint Configuration

    Now we're setting up your S3 to accept table changes and your target database to be updated with them. It might be helpful to think of this as the logical counterpart of the config we just completed; first we connect CDC to the source, now we connect it to the target.

    • Endpoint Name: An arbitrary name that will be the name of the DMS endpoint.
    • S3 Role ARN: IAM Role ARN to attach to the DMS Task Endpoints that gives access to S3 buckets. This Role must be set up in advance by the user - see the S3 Role ARN Section below for details.
    • Bucket Name: The name of the desired S3 bucket. These are autodetected by the credentials you used in the Manage menu.
    • **Batch Interval (s): ** Timing (in seconds) between checks to see if there are differences between your target and source datawarehouse and MYSQL, respectively). We use a default of 300 seconds and suggest anywhere between 60 and 3600. Note that larger CDC Tasks require more time to complete and there is little point having a Max Batch Interval lower than the time it takes for the full CDC Task to complete.
    • Enable Logging: Enable logging of information of migrations during this CDC Task.
    • Table Selection: User-defined rules on which tables to include in the CDC Task (which source tables we want to keep updated in your data warehouse).
      • One source table results in one Matillion ETL job updating one table. Including hundreds or thousands of source tables can potentially bottleneck your Matillion ETL instance.
      • Includes nothing by default. A rule to include tables must be created before a rule to exclude any.
      • A % symbol denotes any number of wildcard characters. Using a lone **% ** will match all tables, such as the Include example below.

    - **Schema name is: **The schema name in your source database that holds the desired tables.
    - Table name is like: A string to match source table names by. Using a lone % denotes any and all tables in the source.
    - **Action: **Whether this rule is to Include or Exclude matched tables. All tables are excluded by default. You must provide an Include action before providing an Exclude action to filter it if desired.

    5. Creating Target Endpoint

    6. Job Configuration

    Next, provide the details for the Job Configuration.

    • **Environment: ** The Environment to run the job in.
    • **Staging Schema: ** The schema that contains the table(s) you wish to use for the staged data. The staged data is the raw change data from the source database.
    • **Staging Table Prefix: **An arbitrary prefix for target table names.
    • **Target Schema: **The schema that is to contain the target table(s). The target data is an identical copy of the source data.
    • Target Table Prefix: An arbitrary prefix for target table names.
    • Job Folder: The folder in which to put the CDC job in Matillion ETL. Clicking Manage will open an explorer where you can select a folder as shown below.

    Clicking Finish will create a DMS Target Endpoint with the name specified in the Target Name field. If that is successful completing the cdc setup on the following page will create the DMS Task and a Run Orchestration job with the same name as the DMS Task is created on the client in the current Version.

    Matillion ETL will prompt for the user to select whether they want to run the CDC Task immediately or not.

    Regardless of your choice, you should see your new CDC Job in the folder you choose. This Job can be edited as like any other Job and it is often desirable to place a workflow after the CDC component such that a workflow is instigated after each update. The job is parameterised using Job Variables according to the CDC Task configuration you completed in the steps above. These Job Variables can be changed to alter the CDC Job's behaviour but this is absolutely not recommended.


    Running and Managing CDC Tasks

    From the Manage CDC Panel (Project Menu → Manage Change Data Capture), CDC Tasks are listed by Identifier with their Status, Source and Completion percentage displayed. Beside this information, there are buttons to Stop or Start the CDC Task and an X button to Delete the CDC Task.


    Starting and Stopping Jobs

    When started, the initial run of a CDC Task will update all source tables fully and bring them into the data warehouse. After this, the CDC Task will continue to run and incrementally check for updates to the source tables until the Task is stopped via the Manage CDC panel.

    When stopping a CDC Task, users can choose to either pause the task or fully stop it. The former can be resumed later while the latter will instigate a full update upon restarting, as though the task had never run before. Thus, it is more expensive (in both computational resources and monetary cost) to fully stop and resume a job.

    CDC tasks can be paused and resumed or started again from scratch (fully reloading all tables).

    Note

    You must click refresh button to see the updated progress for CDC Tasks.


    Monitoring CDC Tasks with AWS CloudWatch

    Users can click the encircled i to open the AWS CloudWatch for visibility of their given CDC Task.

    Note

    Upon clicking i to open AWS CloudWatch, users may need to log into the AWS Console.


    Deleting CDC Tasks

    Tasks can be deleted if they are stopped using the X at their right side. Clicking this button will bring up a new dialog that asks the user to confirm the deletion and lists AWS resources that will be automatically destroyed as part of this CDC task being destroyed.


    Resolving Errors on Completed CDC Tasks

    When a completed CDC Task ends with an error, users can click the wrench icon (highlighted in the below screenshot) to open the Resolve CDC Task Issues window.

    This window will provide the name and description of the error.

    Clicking OK will prompt CDC to attempt to resolve the error.

    Certain examples where CDC can resolve errors upon task completion include:

    • Deleting a task such as DMS, in the AWS console.
    • Deleting an event notification in the AWS console.
    • Deleting an Orchestration Job in Matillion ETL.

    CDC State Change Refresh

    CDC "listens" to internal state broadcasts and refreshes accordingly. For example:

    • When the state of a task is changed, for example renaming, starting, stopping, or deleting a task.
    • When Endpoints are validated.
    • Lambda deletion.
    • When CDC Config is saved.
    • Any other instance where a command produces a state change within CDC.

    The CDC grid refreshes on any state change and this refresh can be observed when refresh button in the toolbar changes to a loading spinner.


    Manage CDC Credentials

    The credentials requested in the Manage CDC dialog can either be "Instance Credentials" or "User-Defined Credentials". These are used for AWS resource creation and modification, and for listening to the SQS queue. Since these credentials need only be used for the CDC Configuration, we recommend:

    1. Making a new IAM User in the AWS Console.
    2. Create a new policy (attached to the user) with the actions given below.
    3. Taking the Access Key and Secret Key for this user.
    4. Creating User-Defined Credentials in Matillion ETL using the Access Key and Secret Key.
    5. Using the new User-Defined Credentials solely for your Manage CDC Configuration.
    Note
    • The below policy makes all resources available to the role but security-conscious users may opt to add their own resource ARNs.
    • The sqs:listqueues and iam:listroles permissions are optional as they are used solely for populating selection boxes in the UI which can be filled in with the correct values regardless.
    {
    
    "Version": "2012-10-17",
    "Statement": [
    {
    "Effect": "Allow",
    "Action": [
    "sqs:CreateQueue",
    "sqs:DeleteMessage",
    "sqs:GetQueueAttributes",
    "sqs:GetQueueUrl",
    "sqs:ChangeMessageVisibility",
    "sqs:ReceiveMessage",
    "s3:PutBucketNotification",
    "s3:GetBucketNotification",
    "dms:CreateEndpoint",
    "dms:CreateReplicationTask",
    "dms:DeleteEndpoint",
    "dms:DeleteReplicationTask",
    "dms:DescribeConnections",
    "dms:DescribeEndpoints",
    "dms:DescribeReplicationInstances",
    "dms:DescribeReplicationTasks",
    "dms:ModifyEndpoint",
    "dms:StartReplicationTask",
    "dms:StopReplicationTask",
    "dms:TestConnection",
    "ec2:DescribeRegions",
    "iam:ListRoles",
    "iam:PassRole",
    "lambda:AddPermission",
    "lambda:CreateFunction",
    "lambda:DeleteFunction",
    "lambda:GetFunction",
    "lambda:GetPolicy",
    "lambda:RemovePermission",
    "lambda:TagResource",
    "lambda:UpdateFunctionCode",
    "lambda:UpdateFunctionConfiguration",
    "s3:ListAllMyBuckets",
    "sqs:ListQueues"
    ],
    "Resource": "*"
    }
    ]
    }
    

    Lambda Role ARN

    The Lambda Role ARN is used as part of the Manage dialog. Your Lambda Role will need access to the "S3 bucket" and "SQS Queue" used in the CDC process. Create a new Lambda Role and attach the policy below. When creating a new Role, ensure that the type of trusted entity is set to AWS Service and Lambda.

    Selecting the new Lambda Role in the AWS Console (IAM section) will expose the Role ARN that can then be used in Matillion ETL. The below policy makes all resources available to the role but security-conscious users may opt to add their own resource ARNs.

    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Effect": "Allow",
    "Action": [
    "s3:PutObject",
    "s3:GetObject",
    "sqs:GetQueueUrl",
    "sqs:SendMessage"
    ],
    "Resource": "*"
    }
    ]
    }
    
    Note

    The following additional IAM permissions are required to allow the Lambda function to write to CloudWatch logs. Users can either explicitly define the below IAM permissions, or add the existing "AWSLambdaBasicExecutionRole" AWS Policy:

    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Effect": "Allow",
    "Action": [
    "logs:CreateLogGroup",
    "logs:CreateLogStream",
    "logs:PutLogEvents"
    ],
    "Resource": "*"
    }
    ]
    }
    

    S3 Role ARN

    The S3 Role ARN is used in the creation of a new CDC Task. This is a Role that has DMS as a Trusted Entity (ergo, the Role has DMS as a Trust Relationship) and gives your Replication Instance access to the S3 bucket it needs to write changes to. Create a new DMS Role and attach the policy below.

    Note

    The S3 Role ARN is named so due to the permissions it gives. However, it MUST be attached to the DMS service. When creating a new Role, ensure that the type of trusted entity is set to AWS Service and **DMS **is selected as the service.

    Selecting the new Lambda Role in the AWS Console (IAM section) will expose the Role ARN that can then be used in Matillion ETL.

    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Effect": "Allow",
    "Action": [
    "s3:PutObject",
    "s3:DeleteObject",
    "s3:ListBucket"
    ],
    "Resource": "*"
    }
    ]
    }
    
    Note

    The below policy makes all resources available to the role but security-conscious users may opt to add their own resource ARNs.


    CDC Log Table

    The shared job writes a log table to the defined staging schema:

    <staging schema>.mat_cdc_log

    This will be created by the Shared Job the first time a CDC Task executes in that staging schema, it contains the following columns:

    • cdc_task_name: The CDC Task Name as entered when the CDC Task was created
    • **cdc_target_table: **The name of the table being updated in the
    • cdc_dms_timestamp: The timestamp of the file produced by DMS (initial load files will be listed as 1900-01-01 00:00:00 as they have no timestamp)
    • cdc_mat_jobstart: The timestamp of when Matillion began processing the file
    • **cdc_s3_bucket:**The bucket containing the DMS file
    • cdc_s3_filename: The path to the DMS file within the bucket
    • cdc_table_status: The status code of the load of the file. Any non-zero value against table will cause subsequent loads to fail. However, a value of 99 can also indicate a load is still in progress and will be updated to 0 when the load completes successfully. Possible codes are:
      • **0: **Load successful
      • **1: **A previous load for this table failed or the file is dated before the last loaded file - the file has not been applied to the target Table
      • **99: **The is still in progress or has failed load - no further update files will be applied for this file

    One row will be written for each DMS file Matillion ETL attempts to load - the table is designed to act as a complete history of each table.

    Useful Queries

    Find any loads in progress, or loads that did not complete successfully:

    SELECT * 
    FROM mat_cdc_log
    WHERE cdc_table_status = 99
    ORDER BY cdc_mat_jobstart DESC;
    

    Find any tables that have failed loads:

    SELECT DISTINCT cdc_task_name, cdc_target_table 
    FROM staging.mat_cdc_log
    WHERE cdc_table_status > 0
    ORDER BY cdc_mat_jobstart
    

    Find the complete load history for a given table:

    SELECT * 
    FROM staging.mat_cdc_log
    WHERE cdc_target_table = '<table name>'
    ORDER BY cdc_mat_jobstart DESC
    

    API

    CDC Configurations are available for importing and exporting via the Matillion v1 API. Below are examples using cURL with resource names (such as <Groupname> and <Projectname>) to be filled out by the user. CDC Configurations are Project-specific and thus can be found under the Project endpoint in the API.

    CDC Configurations can be exported to a JSON file as in the below example:

    curl -X GET -o mtln_project_schedules.json -u api-user:password https://<InstanceAddress>/rest/v1/group/name/<GroupName>/project/name/<ProjectName>/cdcconfig/export

    And imported, using the same format JSON file as below:

    curl -X POST -u api-user:password https://<InstanceAddress>/rest/v1/group/name/<GroupName>/project/name/<ProjectName>/cdcconfig/import --data-binary @mtln_project_schedules.json


    CDC Queue Status

    On the right-hand-side of the Manage CDC tab, users can review the status of their CDC FIFO Queue. The CDC Queue Status panel provides the FIFO Queue name, the number of Messages available, and the number of Messages in flight. Users also have the option to delete their Queue in this panel.


    CDC Lambda Status

    The CDC Lambda Status panel clarifies the Lambda Function name, the date and time of its last modification, the version of the Lambda Function, and links to both the AWS CloudWatch logs and the Lambda console. Users also have the option to delete their Lambda Function in this panel.
    If your Lambda Function version is out of date, a green Update button will appear in the status panel with a message to update.
    When the refresh button (circular arrow) is clicked, these panels will update their information.


    Video