Azure Blob Load Generator

Overview

Azure Blob Load Wizard is a tool that helps users load delimited data from public objects in an Azure Storage Blob.

Unlike common components, this load wizard does not appear as a standalone component when added into the Matillion ETL job canvas. Instead, the load wizard takes the form of a tool that empowers users to load and view files on the fly, altering load component properties and observing their effects without the need for a separate Transformation job. The generator can also guess the schema of a table, relieving much of the end user's work.

This component requires working Azure credentials with "read" access to the Azure Blob containing the source data's file(s).

Layout

Note: Clicking on any of the images in this article will enlarge the image for readability.

When a user drags the Azure Blob Load Wizard onto the canvas, a three-page setup wizard appears (as per the below screenshots).

A file must first be selected from an Azure storage account. This file must be delimited (including .csv) and the user must have permission to access the file. If the file is compressed, the correct compression method must be selected; supported compression methods are gzip and bzip2.

If the selected file is viable, users can select the number of rows they wish to sample and click the Get Sample button. Here, the tool will load the file's data and attempt to guess its schema. Raw data for the file is displayed in the large panel below.

 

On page 2, column data is displayed in the lower-right panel and is available for editing. Schema configuration properties for the table are displayed in the lower-left panel; the available properties are detailed in the below table (article section "Properties").

Selecting the Guess Schema button will order the Load Wizard to attempt to guess as many of these properties as possible. However, all of the properties are available for manual editing by the user, as is the column data in the lower-right panel.

 

When the user is satisfied with the settings, the resulting table output can be viewed by selecting the Test button at the bottom of page 3 of the setup wizard. Running a test will create the requested table and show a sample for the user to inspect. If the user is unhappy with the output, they can return to the 'Configuration' panel on page 2 to make alterations until they are satisfied with the result.

 

Properties

Property Setting Description
For more information on all the settings in this component, see the Snowflake COPY syntax for more information.
Table Name Text The descriptive name for the table.
Field Delimiter Text The Field Delimiter is a character that separates fields in an input file. The default is a Comma (,). A [TAB] character can be specified as "\t".
Field Optionally Enclosed By Text Character used to enclose strings. Value can be:
NONE
single quote character ('); or a double quote character (").
When a field contains this character, escape it using the same character.
Ignore Header Rows Text The number of rows at the top of the file to ignore - defaults to 0.
Date Format Text Defaults to 'auto' - this can be used to manually specify a date format.
Time Format Text Defaults to 'auto' - this can be used to manually specify a time format.
Time Stamp Format Text Defaults to 'auto' - defines the format of timestamp values in the data files to be loaded.
Escape Unenclosed Text Single character string used as the escape character for unenclosed field values only. Default is backslash (\).
Escape Text Single character used as the escape character for any field values.
Null As Text This option replaces the specified string with null in the output table. Use this if your data has a particular representation of missing data.
Trim Blanks Checkbox If this is enabled, Matillion ETL removes trailing and leading white space from the input data.
Error on Column Count Checkbox If enabled, a parsing error is generated when the number of delimited columns (fields) in the input data file does not match the number of columns in the corresponding table.
Empty As Null Checkbox If this is set, empty columns in the input file will become NULL.
 

Example

In the following example, the Azure Blob Load Wizard is used to load the contents of a small file into a table. First, we provide the wizard with the path to the file, and since the file is not compressed, we keep the Compression setting set to "None". After taking a sample of the data, the large panel displays raw data from the file (up to the number of lines specified by the "Row Limit", and we can see that this dataset is a list of US states, cities, their airports and other relevant data, delimited across each field by a comma [,].

 

On page 2 of the setup wizard, the wizard has guessed our table's configuration correctly. We can see in the lower-left panel's configuration that the fields are delimited by a comma, and in the lower-right panel our column data is displayed.

 

And finally, once we click the Test button on the last page of the wizard, our table is generated for us to review. Since everything looks to be in order, we can click the Finish button.

 

From this point, users will return to the job interface and two linked components will be added to the job canvas: Create/Replace Table and Azure Blob Storage Load. Each component is parameterised by the Azure Blob Load Wizard, and can be run as a job by linking to a Start component.