Connecting To Any REST API

Overview

Matillion can extract and load data from most JSON and XML-based REST APIs, and also from some SOAP APIs (those which behave RESTfully).

Good positive indicators are:

  • You can get data from the API using a cURL or Postman command

  • The returned data contains an array (or repeating element) at the granularity you require

  • The documents are regular - i.e. the structure does not differ record by record

  • If the API uses paging, it provides a mechanism that the client can use to navigate from page to page (for example providing a URL which points to the next page)

 

Note that Matillion’s API Query component can not load data from:

  • Data other than valid JSON or XML

  • Binary or unstructured data, such as PDF or Word documents, or images

  • HTML

  • APIs where the data contains embedded CSVs

  • APIs which return XML embedded inside JSON, or vice versa

  • APIs in which the returned content is encoded or encrypted

 

Matillion’s API integration is known as an “API Profile” and is managed through the Project / Manage API Profiles menu. Matillion can contain multiple API Profiles, and each API Profile can contain one or more “files”. A “file” is either a sample data file (JSON or XML) or an RSD file (identified by a .rsd suffix). RSD files contain all the logic needed for Matillion to interact with an API. They are expressed in a declarative XML language.

Integrating with a new API is a four-step process:

  1. Find the API documentation

  2. Load the first page

  3. Establish Paging

  4. Productionize

 

Find the API documentation

In this step, you discover how the API works, the data formats and the endpoint URLs.

Your Matillion instance must be able to access those API endpoint(s) over the network. You can test the connectivity using the Network Connectivity Test shared job which you can download from the Data Stagers Support page. If Matillion does not have network connectivity to the API endpoint(s) you must get this resolved before continuing.

The API documentation should help you answer the following questions:

  • How do you authenticate?

  • How do you supply parameters? (is it in the URL or as a POST body)

  • Does the API provide data at the necessary granularity (for example as a JSON array)?

  • Does the API use paging, and if so how does it provide the information needed to page through the result set?

 

Load the first page

Having checked the API documentation, the next goal is to get Matillion to read the first page of data. This is achieved in several steps:

  • Use cURL (or Postman) to get a response back from the API endpoint you are trying to interface with, and save that data as a JSON or XML file.

  • Use the file to help build the first RSD of your API Profile.

To save the data returned by a cURL command into a file, use the -o option. Consult the API documentation to find the full cURL command required.

curl -o page1-raw.json ...

It’s a good idea to run the resulting file through a prettifier. This verifies that it’s valid, and also makes it much easier to understand.

For JSON, you could use jq:

jq "." < page1-raw.json > page1.json

For XML, you could use xmlstarlet.

xmlstarlet fo < page1-raw.xml > page1.xml

Verify that the file contains an array or repeating element at the desired granularity. Take a note of the path to this repeating group.

Create your new API Profile from Matillion.

Add a new File:

Name the file “page1” without a .json or .xml suffix.

Upload the contents of the data file. You might need to copy-and-paste for this step, in which case be careful to include the entire file.

You’ll need to find the full path of the newly created file, which you can do with a Bash Script component containing this command:

find / -name "page1.json" -print 2>/dev/null


Check the task panel and copy the full path from there.
<

Now you can use the file to generate an RSD. In Matillion’s API Profile editor, press the Generate button.

In the popup dialog, supply values for

  • Table Name - use a valid database table name such as t1
  • Description - a short description such as t1 test
  • XPath to Repeat Element - in the form /path/subpath
  • Data format - set to JSON or XML
  • URI - set to the full path of the newly created file as retrieved earlier
  • ​Properties - leave blank
 

Press OK and Matillion will create a new RSD file, named after your chosen Table Name with a .rsd suffix.

Press the Test button, and your Table Name should appear under Tables. Click once on the table name, and Matillion will run the newly created RSD logic over the static file.

View the sample data at the bottom of the dialog and make sure it is retrieving the data at the expected granularity.

You may wish to adjust the fields returned, which you can do by editing the <attr> elements near the top of the RSD (lines 6, 7 and 8 in the example above).

You should also remove the three rsb:script sections associated with INSERTS, UPDATES and DELETES, as these cannot be used by Matillion.

Once you have your API Profile configured and successfully tested against your static file, you can switch to using the real API. There are two steps:

  • Replace the "uri" value with the actual API endpoint

  • Add the necessary authentication parameters

 

Switching to the real API

URLs and authentication parameters will need to change in a real Production environment, but in order to get the connectivity working we’ll initially use hardcoded values. Once the functionality has been verified, the hardcoded values will be removed as discussed later.

Replace the "uri" value with the actual API endpoint

This involves updating the “uri” attribute to the one you used in the original cURL or Postman command:

Before:

After:

Add the necessary authentication parameters

Many different authentication methods are in common use, and they are discussed in this support document. Often the requirement is to add a custom HTTP header, which you can do by adding two lines near the top of the RSD. It’s usually good to put these just before the place where the uri attribute is set.

Once you have saved the changes to the RSD file, press the Test button again. A new record for t1 should appear again at bottom left. Left-click the t1 listing and Matillion will attempt to contact the real API over the network. Data should appear at bottom right.

If no data appears and it takes a long time (e.g. exactly 60 or 120 seconds), or if you receive timeout errors, review your network connectivity using the Network Connectivity Test shared job as described earlier.

The displayed data might be different from that displayed previously, since it’s coming from the real API this time. Nevertheless it should be at the same granularity.

You can use the API Profile in an API Query component now, and it should successfully return data. However paging has not yet been enabled, so it will only return records from the first page.

 

Establish Paging

Reaching this step of the process you should already have a working API Profile which will retrieve one page of data.

Most APIs don’t return all the data in a single response. This would be very inefficient and error-prone. Instead, most APIs use paging, which can help return a large number of records in smaller batches.

Several additional RSD features need to be added to enable paging.

First, add a new input named 'Rows@Next' to the list of columns inside the existing rsb:info block.

<input name="Rows@Next" desc="Identifier for the next page of results" />

You must also add the following to the rsb:script section. A good position is just after the authentication settings.

<rsb:set attr="EnablePaging" value="TRUE" />

Now you need to make the parameters of the API call differ for each call of the paging operation. This is how the API server distinguishes between requests for different pages.

Many APIs provide the client with the next URL to be called in order to fetch the next page. These are straightforward: you need to do two things

  1. Set Rows@Next using information in the returned data - usually a full or partial URL pointing to the next page of results

  2. Replace the hardcoded "uri" attribute with one that uses Rows@Next if it has been set

Set Rows@Next using information in the returned data

You’ll need to identify the location of the next URL to call. If it’s a top-level element, for example named nextPage, then you can instruct the RSD to pick up the value by adding two more lines like below. A good position for these lines is just after the EnablePaging directive.

<rsb:set attr="elementmappath#" value="/nextPage" />

<rsb:set attr="elementmapname#" value="Rows@Next" />

Replace the hardcoded "uri" attribute

This can no longer be hardcoded, and instead needs to use the Rows@Next variable if it has been set.

In place of the simple <rsb:set attr="uri" … add a conditional like this:

<rsb:check attr="_input.Rows@Next">
 <rsb:set attr="uri" value="something using [_input.Rows@Next]" />
 <rsb:else>
   <rsb:set attr="uri" value="URI of the first page" />
 </rsb:else>
</rsb:check>

In the above example, the URI of the first page is the one already used in the previous section. The construction of the URI for subsequent pages depends upon the implementation details of the API.

 
Mechanics of paging

With the Rows@Next mechanism in place, the driver will automatically loop until either

  • No records were returned

  • Rows@Next is missing

  • Rows@Next did not change since the last iteration

Paging can be tricky to get right because RSD is a declarative language rather than an explicit loop, and so can be difficult to debug. It’s usually impossible to tell just from the Test screen whether or not paging has been invoked, because the Test screen only shows the first few records.

The best way to be certain that paging is working is to use the API Profile in an API Query component, and check that it successfully returns all the data that matches the query.

By default new API Query components have the Limit property set to 100. Remove this value to allow the component to retrieve all the data.

 

Productionize

By this step of the process you should have a fully working RSD, but most likely containing some hardcoded authentication parameters.

The RSD is fully functional as-is, but it’s bad practice to leave hardcoded authentication parameters in place, for two reasons:

  • It’s insecure: anyone logged into Matillion can read the credentials in plain text, so they need to be properly protected

  • Authentication parameters usually expire at some point, so the existing API Profile has a limited shelf life

 

To remedy these things, the RSD needs to be parameterized, and the authentication parameters need to be obtained at runtime and passed into the RSD.

For further reading on this subject:
 

  • This support article examples how to connect to the Salesforce REST API.
  • This support article describes how to use parameters with API profiles
  • ​This support article describes how to obtain an API token at runtime and pass it into an RSD script
  • This support article describes how to implement password management functionality, to protect your private API credentials