S3 Load

Dark
Light

Article Summary

Share feedback

Thanks for sharing your feedback!

This article is specific to the following platforms - Snowflake - Redshift - Delta Lake.

S3 Load Component

The S3 Load component lets users load data into an existing table from objects stored in Amazon Simple Storage Service (Amazon S3).

The S3 Load component requires working AWS Credentials, with Read access to the bucket containing the source data file(s). This is achieved by attaching an IAM role to the instance when launching Matillion ETL. However, it can also be managed manually by editing a Matillion ETL environment.

Furthermore, Matillion ETL requires use of a policy that contains the action s3:ListBuckets, such as the policy provided in Manage Credentials.

To access an S3 bucket from a different AWS account, the following is required:

Set up cross-account access via AWS roles.
The user must type in the bucket they want to access or use a variable to load/unload to those structures.

Matillion ETL for Redshift: When selecting a target table with the Redshift SUPER data type, the S3 Load component will only offer the following data file types: JSON, ORC, and PARQUET. However, if you select to include columns of data file types other than SUPER in the Load Columns property, all data file types will be available for selection.

To help with S3 Load, the S3 Load Generator component can be used. This is a wizard that lets you load and view files on the fly, altering load component properties and observing their effects without the need for a separate Transformation job. For a complete description of the S3 Load Generator wizard, read the following:

Properties

Snowflake Properties
Property	Setting	Description
Name	String	A human-readable name for the component.
Stage	Select	Select a staging area for the data. Staging areas can be created through Snowflake using the CREATE STAGE command. Internal stages can be set up this way to store staged data within Snowflake. Selecting [Custom] will avail the user of properties to specify a custom staging area on S3. Users can add a fully qualified stage by typing the stage name. This should follow the format `databaseName.schemaName.stageName`
Authentication	Select	Select the authentication method. Users can choose either: Credentials: Uses AWS security credentials. Read Manage Credentials to learn more. Storage Integration: Uses a Snowflake storage integration. A storage integration is a Snowflake object that stores a generated identity and access management (IAM) entity for your external cloud storage, along with an optional set of permitted or blocked storage locations (Amazon S3, Google Cloud Storage, or Microsoft Azure). More information can be found at CREATE STORAGE INTEGRATION.
Credentials	Select	Select your AWS credentials. The special value, [Environment Default], uses the set of credentials specified in your Matillion ETL environment—this is the default value. Click Manage to edit or create new credentials in Manage Credentials.
Storage Integration	Select	Select the storage integration. Storage integrations are required to permit Snowflake to read data from and write to a cloud storage location. Integrations must be set up in advance of selecting them in Matillion ETL. To learn more about setting up a storage integration, read Storage Integration Setup Guide. Note: Storage integrations can be configured to support Amazon S3, Google Cloud Storage, or Microsoft Azure cloud storage regardless of the cloud provider that hosts your Snowflake account.
S3 Object Prefix	Filepath \| Select	Specify the URL of the S3 bucket to load files from. The URL follows the format `s3://bucket/path` Note: The "path" parameter in the URL is the subfolder and should be included. When a user enters a forward slash character / after a folder name, a validation of the file path is triggered. This works in the same manner as the Go button.
Pattern	String	A string that will partially match all file paths and names that are to be included in the load. Defaults to '.' indicating all files within the S3 Object Prefix. This property is a pattern on the complete path of the file, and is not* just relative to the directory configured in the S3 Object Prefix property. For more information, click here. Note: The subfolder containing the object to load must be included here.
Encryption	Select	Select how the files are encrypted inside the S3 Bucket. This property is available when using an Existing Amazon S3 Location for Staging. Client Side Encryption: data is encrypted with client-side encryption. None: no encryption. SSE Encryption: Encrypt the data according to a key stored on KMS. Read AWS Key Management Service (AWS KMS) to learn more. S3 Encryption: encrypt the data according to a key stored on an S3 bucket.
Warehouse	Select	Choose a Snowflake warehouse that will run the load.
Database	Select	Select a database. A database is a logical grouping of schemas. Each database belongs to a single Snowflake account.
Schema	Select	Select the schema. A schema is a logical grouping of database "objects" (tables, views, etc.). Each schema belongs to a single database. The special value, [Environment Default], will use the schema defined in the environment. For more information on using multiple schemas, see this article.
Target Table	Select	Select an existing table to load data into. The tables available for selection depend on the chosen schema.
Load Columns	Column select buttons	Select which columns to include in the load.
Format	Select	Select a pre-made file format that will automatically set many of the S3 Load component properties. These formats can be created through the Create File Format component. Users can add a fully qualified format by typing the format name. This should read as `databaseName.schemaName.formatName`
File Type	Select	Select the type of data to load. Available data types are: AVRO, CSV, JSON, ORC, PARQUET, and XML. Some file types may require additional formatting—this is explained in the Snowflake documentation. Component properties will change to reflect the selected file type.
Compression	Select	Select the compression method if you wish to compress your data. If you do not wish to compress at all, select NONE. The default setting is AUTO.
Record Delimiter	String	(CSV only) Input a delimiter for records. This can be one or more single-byte or multibyte characters that separate records in an input file. Notes: Accepted values include: leaving the field empty; a newline character \ or its hex equivalent 0x0a; a carriage return \\r or its hex equivalent 0x0d. Also accepts a value of NONE If you set the Skip Header to a value such as 1, then you should use a record delimiter that includes a line feed or carriage return, such as \ or \\r. Otherwise, your entire file will be interpreted as the header row, and no data will be loaded. The specified delimiter must be a valid UTF-8 character and not a random sequence of bytes. Do not specify characters used for other file type options such as Escape or Escape Unenclosed Field. The default (if the field is left blank) is a newline character.
Field Delimiter	String	(CSV only) Input a delimiter for fields. This can be one or more single-byte or multibyte characters that separate fields in an input file. Notes: Accepted characters include common escape sequences, octal values (prefixed by \\\\), or hex values (prefixed by 0x). Also accepts a value of NONE. This delimiter is limited to a maximum of 20 characters. While multi-character delimiters are supported, the field delimiter cannot be a substring of the record delimiter, and vice versa. For example, if the field delimiter is "aa", the record delimiter cannot be "aabb". The specified delimiter must be a valid UTF-8 character and not a random sequence of bytes. Do not specify characters used for other file type options such as Escape or Escape Unenclosed Field. The Default setting is a comma: ,
Skip Header	Integer	(CSV only) Specify the number of rows to skip. The default is 0. If Skip Header is used, the value of the record delimiter will not be used to determine where the header line is. Instead, the specified number of CRLF will be skipped. For example, if the value of Skip Header = 1, then Matillion ETL will skip to the first CRLF that it finds. If you have set the Field Delimiter property to be a single character without a CRLF, then Matillion ETL skips to the end of the file (treating the entire file as a header).
Skip Blank Lines	Boolean	(CSV only) When "True", ignores blank lines that only contain a line feed in a data file and does not try to load them. Default setting is "False".
Date Format	String	(CSV only) Define the format of date values in the data files to be loaded. If a value is not specified or is AUTO, the value for the DATE_INPUT_FORMAT session parameter is used. The default setting is AUTO.
Time Format	String	(CSV only) Define the format of time values in the data files to be loaded. If a value is not specified or is AUTO, the value for the TIME_INPUT_FORMAT session parameter is used. The default setting is AUTO.
Timestamp Format	String	(CSV only) Define the format of timestamp values in the data files to be loaded. If a value is not specified or is AUTO, the value for the TIMESTAMP_INPUT_FORMAT session parameter is used.
Escape	String	(CSV only) Specify a single character to be used as the escape character for field values that are enclosed. Default is NONE.
Escape Unenclosed Field	String	(CSV only) Specify a single character to be used as the escape character for unenclosed field values only. Default is \\\\. If you have set a value in the property Field Optionally Enclosed, all fields will become enclosed, rendering the Escape Unenclosed Field property redundant, in which case it will be ignored.
Trim Space	Boolean	(CSV only) When "True", removes whitespace from fields. Default setting is "False".
Field Optionally Enclosed	String	(CSV only) Specify a character used to enclose strings. The value can be NONE, single quote character ('), or double quote character ("). To use the single quote character, use the octal or hex representation (0x27) or the double single-quoted escape (''). Default is NONE. Note: When a field contains one of these characters, escape the field using the same character. For example, to escape a string like this: 1 "2" 3, use double quotation to escape, like this: 1 ""2"" 3.
Null If	String	Specify one or more strings (one string per row of the table) to convert to NULL values. When one of these strings is encountered in the file, it is replaced with an SQL NULL value for that field in the loaded table. Click + to add a string.
Error On Column Count Mismatch	Boolean	(CSV only) When "True", generates an error if the number of delimited columns in an input file does not match the number of columns in the corresponding table. When "False", an error is not generated and the load continues. If the file is successfully loaded in this case: Where the input file contains records with more fields than columns in the table, the matching fields are loaded in order of occurrence in the file, and the remaining fields are not loaded. Where the input file contains records with fewer fields than columns in the table, the non-matching columns in the table are loaded with NULL values. In Matillion ETL, the default setting is False.
Empty Field As Null	Boolean	(CSV only) When "True", inserts NULL values for empty fields in an input file. This is the default setting.
Replace Invalid Characters	Boolean	When "True", Snowflake replaces invalid UTF-8 characters with the Unicode replacement character. When "True", the load operation produces an error when invalid UTF-8 character encoding is detected. The default setting is "False".
Encoding Type	Select	(CSV only) Select the string that specifies the character set of the source data when loading data into a table. Please refer to the Snowflake documentation for more information.
Enable Octal	Boolean	(JSON only) When "True", enables the parsing of octal values. Default setting is "False".
Allow Duplicates	Boolean	(JSON only) When "True", allows duplicate object field names. Default setting is "False".
Strip Outer Array	Boolean	(JSON only) When "True", instructs the JSON parser to remove outer brackets. Default setting is False.
Strip Null Values	Boolean	(JSON only) When "True", instructs the JSON parser to remove any object fields or array elements containing NULL values. Default setting is "False".
Ignore UTF8 Errors	Boolean	(JSON, XML only) When "True", replaces any invalid UTF-8 sequences with Unicode characters. When "False", UTF-8 errors will not produce an error in the job run. Default setting is "False".
Preserve Space	Boolean	(XML only) When "True", the XML parser preserves leading and trailing spaces in element content. Default setting is "False".
Strip Outer Element	Boolean	(XML only) When "True", the XML parser strips out any outer XML elements, exposing second-level elements as separate documents. Default setting is "False".
Disable Snowflake Data	Boolean	(XML only) When "True", the XML parser will not recognise Snowflake semi-structured data tags. Default setting is "False".
Disable Auto Convert	Boolean	(XML only) When "True", the XML parser will disable automatic conversion of numeric and Boolean values from text to native representations. Default setting is "False".
On Error	Select	Decide how to proceed upon an error. 1. Abort Statement: Aborts the load if any error is encountered. 2. Continue: Continue loading the file. 3. Skip File: Skip file if any errors are encountered in the file. 4. Skip File When n Errors: Skip file when the number of errors in the file is equal to or greater than the specified number in the next property, n. 5. Skip File When n% Errors: Skip file when the percentage of errors in the file exceeds the specified percentage of n. Default setting is Abort Statement.
n	Integer	Specify the number of errors or the percentage of errors required for Matillion ETL to skip the file. Note: This parameter only accepts integer characters. % is not accepted. Specify percentages as a number only.
Size Limit (B)	Integer	Specify the maximum size, in bytes, of data to be loaded for a given COPY statement. If the maximum is exceeded, the COPY operation discontinues loading files. For more information, please refer to the Snowflake documentation.
Purge Files	Boolean	When "True", purges data files after the data is successfully loaded. Default setting is "False".
Match By Column Name	Select	Specify whether to load semi-structured data into columns in the target table that match corresponding columns represented in the data. Case Insensitive: Load semi-structured data into columns in the target table that match corresponding columns represented in the data. Column names should be case-insensitive. Case Sensitive: Load semi-structured data into columns in the target table that match corresponding columns represented in the data. Column names should be case-sensitive. None: The COPY operation loads the semi-structured data into a variant column or, if a query is included in the COPY statement, transforms the data.
Truncate Columns	Boolean	When "True", strings are automatically truncated to the target column length. When "False", the COPY statement produces an error if a loaded string exceeds the target column length. Default setting is "False".
Force Load	Boolean	When "True", loads all files, regardless of whether they have been loaded previously and haven't changed since they were loaded. Default setting is "False". Note: When set to "True", this option reloads files and can lead to duplicated data in a table.
Metadata Fields	Column Selector	Snowflake metadata columns available to include in the load. Snowflake automatically generates metadata for files in internal stages (i.e. Snowflake) and external stages (Amazon S3, Google Cloud Storage, or Microsoft Azure). This metadata is "stored" in virtual columns. These metadata columns are added to the staged data, but are only added to the table when included in a query of the table. For more information, read Querying Metadata for Staged Files. This property is only available when an external stage is selected. To manage stages, click the Environments panel in the bottom-left, then right-click a Matillion ETL environment, and click Manage Stages. To learn more, read Manage Stages.

Redshift Properties
Property	Setting	Description
Name	String	A human-readable name for the component.
Schema	Select	Select the table schema. The special value, [Environment Default], will use the schema defined in the environment. For more information on using multiple schemas, see this article.
Target Table Name	Select	Select the target table from the dropdown menu. The selected target table is where data will be loaded into from from the S3 object.
Load Columns	Column Selector	Select which of the target table's columns should be loaded.
S3 URL Location	Filepath \| Select	Specify the URL of the S3 bucket to load files from. The URL follows the format `s3://bucket-name/location` The "location" parameter in the URL is optional. When a user enters a forward slash character / after a folder name, a validation of the file path is triggered. This works in the same manner as the Go button.
S3 Object Prefix	String	All files that begin with this prefix will be included in the load into the target table.
IAM Role ARN	String	Supply the value of a role Amazon Resource Name (ARN) that is already attached to your Redshift cluster, and that has the necessary permissions to access S3. This setting is optional, since without this style of setup, the credentials of the environment (instance credentials or manually entered access keys) will be used. See the Redshift documentation for more information about using a role ARN with Redshift.
Data File Type	Select	The type of expected data to load. Some may require additional formatting, explained in the Amazon Documentation. Available options are: Avro, CSV, Delimited, Fixed Width, JSON, ORC, PARQUET. Component properties will change to reflect the choice made here and give options based on the specific data file type.
Avro Layout	String	(Avro only) Defaults to "auto", which will work for the majority of Avro files if the fields match the table field names. Optionally, users can specify the URL to a JSONPaths file to map the data elements in the Avro source data to the columns in the target table.
Fixed Width Spec	String	(Fixed Width only)Loads the data from a file where each column width is a fixed length, rather than columns being separated by a delimiter. For more information, see the AWS documentation.
JSON Layout	String	(JSON only) Defaults to "auto", which will work for the majority of JSON files if the fields match the table field names. More information about JSON file types can be found in the AWS documentation.
Delimiter	String	(CSV only) Specify a delimiting character to separate columns. The default character is a comma: , A [TAB] character can be specified using "\ ".
CSV Quoter	String	(CSV only) Specify the character to be used as the quote character when using the CSV file type.
Explicit IDs	Select	Select whether to load data from the S3 objects into an IDENTITY column. See the Redshift documentation for more information.
S3 Bucket Region	Select	Select the Amazon S3 region that hosts the selected S3 bucket. This setting is optional and can be left as "None" if the bucket is in the same region as your Redshift cluster.
Compression Method	Select	Specify whether the input is compressed in any of the following formats: BZip2, gzip, LZop, Zstd, or not at all (None).
Encoding	Select	Select the encoding format the data is in. The default for this setting is UTF-8.
Replace Invalid Characters	Single Character	Specify a single character that will replace any invalid unicode characters in the data. The default is: ?.
Remove Quotes	Select	(Delimited, Fixed Width only) When "Yes", removes surrounding quotation marks from strings in the incoming data. All characters within the quotation marks, including delimiters, are retained. If a string has a beginning single or double quotation mark but no corresponding ending mark, the COPY command fails to load that row and returns an error. The default setting is "No". For more information, see the AWS documentation.
Maximum Errors	Integer	The maximum number of individual parsing errors permitted before the load will fail. Valuers up to this will be substituted as NULL values. In Matillion ETL, the default is set at 0. The Amazon default is 1000.
Date Format	String	Defaults to 'auto'. Users can specify their preferred date format manually if they wish. For more information on date formats, see the AWS documentation.
Time Format	String	Defaults to 'auto'. Users can specify their preferred time format manually if they wish. For more information on time formats, see the AWS documentation.
Ignore Header Rows	Integer	Specify the number of rows at the top of the file to ignore. The default is 0 (no rows ignored).
Accept Any Date	Select	Yes: invalid dates such as '45-65-2020' are not considered an error, but will be loaded as NULL values. This is the default setting. No: invalid dates will return an error.
Ignore Blank Lines	Select	(CVS only) When "Yes", ignores blank lines that only contain a line feed in a data file and does not try to load them. This is the default setting.
Truncate Columns	Select	Yes: any instance of data in the input file that is too long to fit into the specified target column width will be truncated to fit, instead of causing an error. No: any data in the input file that is too long to fit into the specified target column width will cause an error. This is the default setting.
Fill Record	Select	When "Yes", allows data files to be loaded when contiguous columns are missing at the end of some records. The missing columns are filled with either zero-length strings or NULLs, as appropriate for the data types of the columns in question. This is the default setting.
Trim Blanks	Select	Yes: Removes the trailing white space characters from a VARCHAR string. This property only applies to columns with a VARCHAR data type. No: does not remove trailing white space characters from the input data.
NULL As	String	Loads fields that match the specified NULL string. The default is \\N with an additional \\ at the start to escape. Case-sensitive. For more information, please see the AWS documentation.
Empty As Null	Select	When "Yes", empty columns in the input file will become NULL values. This is the default setting.
Blanks As Null	Select	When "Yes", loads blank columns, which consist of only whitespace characters, as NULL. This is the default setting. This option applies only to CHAR and VARCHAR columns. Blank fields for other data types, such as INT, are always loaded with NULL. For example, a string that contains three space characters in succession (and no other characters) is loaded as a NULL. The default behavior, without this option, is to load the space characters as is.
Comp Update	Select	When "On", compression encodings are automatically applied during a COPY command. This is the default setting. This is usually a good idea to optimise the compression used when storing the data.
Stat Update	Select	When "On", governs automatic computation and refreshing of optimiser statistics at the end of a successful COPY command. This is the default setting.
Escape	Select	(Delimited only) When "Yes", the backslash character \\ in input data is treated as an escape character. The character that immediately follows the backslash character is loaded into the table as part of the current column value, even if it is a character that normally serves a special purpose. For example, you can use this parameter to escape the delimiter character, a quotation mark, an embedded newline character, or the escape character itself when any of these characters is a legitimate part of a column value. The default setting is "No".
Round Decimals	Select	When "Yes", any decimals are rounded to fit into the column in any instance where the number of decimal places in the input is larger than defined for the target column. This is the default setting.
Manifest	Select	When "Yes", the given object prefix is that of a manifest file. The default setting is "No". For more information, see the AWS documentation.

Delta Lake Properties
Property	Setting	Description
Name	String	A human-readable name for the component.
Location	S3 Bucket	An S3 bucket from which to load data. A template URL is provided: S3://<bucket>/<path> Users can also click through the nested folder system. When a user enters a forward slash character / after a folder name, a validation of the file path is triggered. This works in the same manner as the Go button.
Pattern	Regular Expression	A string that will partially match all file paths and names that are to be included in the load. Defaults to .*, indicating all files within the S3 Object Prefix.
Catalog	Select	Select a Databricks Unity Catalog. The special value, [Environment Default], will use the catalog specified in the Matillion ETL environment setup. Selecting a catalog will determine which databases are available in the next parameter.
Database	Select	Select the Delta Lake database. The special value, [Environment Default], will use the database specified in the Matillion ETL environment setup.
Target Table	Select	Select an existing table to load data into. The chosen Delta Lake database determines the available tables.
Load Columns	Column Selector	Select any columns to include in the data load. Move columns from the left list to the right list to include them.
Recursive File Lookup	Boolean	When enabled, disables partition inference. To control which files are loaded, use the "pattern" property instead.
File Type	Select	Select the file type. Available file types are AVRO, CSV, JSON, and PARQUET.
Skip Header	Boolean	(CSV only) When True, uses the first line as names of columns. Default is False.
Field Delimiter	Delimiting Character	(CSV only) Specify a delimiter to separate columns. The default is a comma ,. A TAB character can be specified as "\ ".
Date Format	String	(CSV & JSON only) Manually set a date format. If none is set, the default is `yyyy-MM-dd`.
Timestamp Format	String	(CSV & JSON only) Manually set a timestamp format. If none is set, the default is `yyyy-MM-dd'T'HH:mm:ss.[SSS][XXX]`.
Encoding Type	String	(CSV & JSON only) Decodes the CSV files via the given encoding type. If none is set, the default is `UTF-8`.
Ignore Leading White Space	Boolean	(CSV only) When True, skips any leading whitespaces. Default is False.
Ignore Trailing White Space	Boolean	(CSV only) When True, skips any trailing whitespaces. Default is False.
Infer Schema	Boolean	(CSV only) When True, infers the input schema automatically from the data. Default is False.
Multi Line	Boolean	When True, parses records, which may span multiple lines. Default is False.
Null Value	String	(CSV only) Sets the string representation of a null value. The default value is an empty string.
Empty Value	String	(CSV only) Sets the string representation of an empty value. The default value is an empty string.
Primitive as String	Boolean	(JSON only) When True, primitive data types are inferred as strings. Default is False.
Prefers Decimal	Boolean	(JSON only) When True, infers all floating-point values as a decimal type. If the values don't fit in decimal, then they're inferred as doubles. Default is False.
Allow Comments	Boolean	(JSON only) When True, allows JAVA/C++ comments in JSON records. Default is False.
Allow Unquoted Field Names	Boolean	(JSON only) When True, allows unquoted JSON field names. Default is False.
Allow Single Quotes	Boolean	(JSON only) When True, allows single quotes in addition to double quotes. Default is True.
Allow Numeric Leading Zeros	Boolean	(JSON only) When True, allows leading zeros in numbers, e.g. `00019`. Default is False.
Allow Backslash Escaping Any Character	Boolean	(JSON only) When True, allows accepting the quoting of all characters using the backslash quoting mechanism \\. Default is False.
Allow Unquoted Control Chars	Boolean	(JSON only) When True, allows JSON strings to include unquoted control characters (ASCII characters where their value is less than 32, including Tab and line feed characters). Default is False.
Drop Field If All Null	Boolean	(JSON only) When True, ignores column of all null values or empty arrays/structs during the schema inference. Default is False.
Merge Schema	Boolean	(AVRO, PARQUET only) When True, merges schemata from all Parquet part-files. Default is False.
Path Glob Filter	String	An optional glob pattern, used to only include files with paths matching the pattern.
Force Load	Boolean	When True, idempotency is disabled and files are loaded regardless of whether they've been loaded before. Default is False.

Matillion ETL for Snowflake: If using an external stage, consider the following:

All columns to be included in the load must be included in the Load Columns property, in the correct order and associated with the correct data type. The Create Table component can be used to specify the metadata of the columns.
The Metadata Fields property will insert metadata columns at the end of the existing table and overwrite the same number of columns unless additional columns are added via the Create Table component.
The table's data and structure are only accessed at runtime, meaning that additional columns must be added (and data types set) before the job is run. To ensure these columns are set up beforehand, users can load their data with Matillion ETL's S3 Load Generator.

File patterns with Snowflake

In Snowflake, the Pattern parameter in the COPY INTO syntax is a pattern on the complete path of the file and is not just relative to the directory configured in the S3 Object Prefix parameter.

The table below provides an example of S3 Object Prefix and Pattern behaviours, including success and failure states.

S3 Object Prefix	Pattern	Outcome	Comments
s3://testbucket/	testDirectory/alphabet_0_0_0.csv.gz	Success	This is the format that the S3 Load Generator will generate.
s3://testbucket/testDirectory/	testDirectory/alphabet_0_0_0.csv.gz	Success	Loads the file successfully because the pattern is matching the full path.
s3://testbucket/testDirectory/	.*.csv.gz	Success	Would load all files ending in .csv.gz in the testDirectory directory.
s3://testbucket/testDirectory/	alphabet_0_0_0.csv.gz	Failure	Does not load the file because the pattern does not match.

Video

What's Next

S3 Manifest Builder

Table of contents

S3 Load Component
Properties
Snowflake Properties
Redshift Properties
Delta Lake Properties
File patterns with Snowflake
Video

S3 Load

S3 Load Component

Properties

Snowflake Properties

Redshift Properties

Delta Lake Properties

File patterns with Snowflake

Video

What's Next