Executing Python Scripts Outside of Matillion

Overview

Matillion has the functionality to run Python scripts as part of a Matillion job run. This is used for setting variable values, running quick queries on the target database using database cursors, or connecting to other AWS services. However, there are some circumstances where the amount of processing required by the Python script means it can be more sensible to run the Python script elsewhere to minimise the impact on the rest of the Matillion server. This can be because the Python script requires a large amount of processing power, which may affect the resources on the EC2 instance, which is used to host the Matillion application or because of corporate requirements around security or execution isolation.

Examples of where this is the case include processing large files such as running XML to JSON conversion, or zipping/unzipping large files, or running API Calls that require or return a lot of data. In this blog we will look at a simple Python script that is used to convert an XML file to a JSON file so it can be loaded into the target database. This script, while simple, can require a lot of memory and processing power to convert large files.

Below are some options for running Python scripts outside of the Matillion instance using the Bash Script component.


SSH

It’s possible to run the script remotely on another EC2 instance using an SSH command to connect to the instance. If required, the AWS CLI can even be used to start and stop the instance on demand. The SSH command is used to connect to the instance using a key stored on the Matillion Server. The script is run using the “python” command. In this example, the Python script is passed through to the remote server using a “heredoc”. Below is an example script:

 
ssh -o StrictHostKeyChecking=no -i id_rsa ec2-user@ip-172-31-30-130.eu-west-1.compute.internal "python" << END_OF_PYTHON  

import xmltodict, json, boto3

s3 = boto3.resource('s3')
s3.meta.client.download_file('mtln-laura-test', 'example.xml', '/tmp/example.xml’)

with open ("/tmp/example.xml", "r") as myfile:
    xml_data=myfile.read()

o = xmltodict.parse(xml_data)
#print json.dumps(o)
open('/tmp/example.json', 'w').write(json.dumps(o))
s3.meta.client.upload_file('/tmp/example.json, 'mtln-laura-test', 'example.json')
print 'file converted successfully'

END_OF_PYTHON

The result of the Python script will be printed in the Bash window:

 


Advantages:

  • Control over the memory allocated on the remote server to the Python script
  • All libraries can be installed and maintained on the remote server
  • Version of Python can be changed as required


Disadvantages:

  • Cost and maintenance implication of a new server
  • Cannot reuse Matillion connection into Redshift to run commands on Redshift (Database Cursor) although a new connection could be used
  • Variables are not available in the global namespace, nor can they be updated i.e. it’s not possible to reuse the Matillion “context” object provided. However, referencing variables using the syntax ${this_syntax} will resolve to the variable value before the Bash script executes


Lambda

AWS Lambda allows you to run code without worrying about servers. You simply pay for the compute time you use. Python 2.7, Python 3.6 and Python 3.7 are amongst the languages supported. To run a Python script in AWS Lambda, a Lambda function must first be created and then run when required. Functions can be created and run through the AWS web console. A simple example function is shown below:


 

The AWS CLI supports creating a Python function in Lambda and then running it. These CLI commands can be run from a Bash Script component within Matillion. The example script below assumes the helloworld.py Python script exists on the Matillion server in the /tmp directory and walks through the steps required to use this to create a Lambda function from it and run it accordingly.

Please note the below will redefine the function over and over. If the code does not change, it can simply be run using the run statement:

 
#Lambda requires function to be run is called index.py so first rename python function
cp /tmp/xmltojson.py index.py

#zip function up
zip function.zip index.py

#delete function if already exists
aws lambda delete-function --function-name helloworld --region eu-west-1

#create function in Lambda
aws lambda create-function --function-name helloworld --zip-file fileb://~/function.zip --handler index.handler --runtime python2.7 --role arn:aws:iam::557249109363:role/lambda-cli-role --region eu-west-1

#run function

aws lambda invoke --function-name helloworld --log-type Tail /tmp/outputfile.txt --region eu-west-1

#delete again
aws lambda delete-function --function-name helloworld --region eu-west-1
 

Note the AWS Credentials for the Environment used to run this job requires the IAM policy AWSLambdaBasicExecutionRole.


Advantages:

  • No need to worry about servers
  • Just pay for compute resource used - there’s no chance of accidentally leaving a server on


Disadvantages:

  • Script can’t run for more than 15 mins
  • Requires access IAM policy AWSLambdaBasicExecutionRole
  • Installing dependencies must be done by having all dependencies locally and packaging them with the lambda function or using the console
  • Cannot reuse Matillion connection into Redshift to run commands on Redshift (Database Cursor)
  • Output from function is not displayed in the Task History, instead it has to be read from a separate file
  • Requires AWS Lambda which has a cost implication


Glue

AWS Glue can be used to create and run Python Shell jobs. These are Python scripts which are run as a shell script, rather than the original Glue offering of only running PySpark. Any script can be run, providing it is compatible with 2.7.

As with the Lambda function, first an AWS Glue ETL job must be created, as a Python Shell job, and then executed. The AWS Glue job is created by linking to a Python script in S3, a IAM role is granted to run the Python script under and any connections available connections, such as to Redshift are selected:

 

Again, the Glue Job can be created either via the console or the AWS CLI. Below are the steps to create and run the job via the AWS CLI from a Bash script component within Matillion. If the code doesn’t change, we recommend creating this job once and simply running the job from Matillion:

 
#script must exist on S3 for glue
aws s3 cp /tmp/xmltojson.py s3://mtln-laura-ireland/xmltojson.py --region eu-west-1
aws glue create-job --name python-job-cli --role jc-glue-role --command '{"Name" :  "LauraPythonTest", "ScriptLocation" : "s3://mtln-laura-ireland/xmltojson.py"}' --region eu-west-1

#output results to a variable to use in Cloudwatch command
JOBOUTPUT=$(aws glue start-job-run --job-name LauraPythonTest --region eu-west-1)

#get job id from json
JOBID=$(echo $JOBOUTPUT| cut -d '"' -f 4)
echo $JOBID

#get logs
sleep 30

aws logs get-log-events --log-group-name /aws-glue/python-jobs/output --log-stream-name $JOBID --region eu-west-1 > /tmp/output.txt
 

In AWS Glue, the output of a job run is written to Cloudwatch. If these are required in Matillion, the output of the command has to be written to a file on the Matillion server.

Note to run this script the Matillion environment requires the below policies:

  • AmazonGlueServiceRole
  • IAMReadOnlyAccess

 

Things to consider when using Glue:
  • Can specify a script timeout
  • In the glue script, parameters can be defined and referenced in the Python. When the script is run, values for the parameters can be provided. This can be done by using Matillion variables in the Bash script. This means Glue scripts can be re-usable scripts and avoid having to push a new script with each execution.
  • Must be Python 2.7
  • Limited libraries - but can upload your own
  • Python script needs to be on S3
  • Cannot reuse Matillion connection into Redshift to run commands on Redshift (Database Cursor)
  • Requires AWS Glue which has a cost implication


Example with Variables

It is often the case that there is a requirement that part of the Python script is made dynamic. This is achieved using variables in Matillion. For example in the SSH example discussed earlier, the S3 Bucket and file names were hard coded but it is unlikely that these filenames will always be consistent.

It is also unrealistic to update the Python script to simply update the file or the bucket names. Instead job variables in the Matillion job can set up and these can be referenced in the heredoc, which pushes the variable value to the remote server:

 
ssh -o StrictHostKeyChecking=no -i id_rsa ec2-user@ip-172-31-30-130.eu-west-1.compute.internal "python" << END_OF_PYTHON

import xmltodict, json, boto3

s3 = boto3.resource('s3')
s3.meta.client.download_file('$s3Bucket', '$xmlFile', '/tmp/$xmlFile')

with open ("/tmp/$xmlFile", "r") as myfile:
   xml_data=myfile.read()

o = xmltodict.parse(xml_data)

#print json.dumps(o)
open('/tmp/$xmlFile.json', 'w').write(json.dumps(o))
s3.meta.client.upload_file('/tmp/$xmlFilejson', '$s3Bucket', '$xmlFile.json')
print 's3://$s3Bucket/$xmlFile.json'

END_OF_PYTHON

The script has also been edited to return the bucket and filename created.

The output from the script run can also be pushed to an output file, which can then be read by another Matillion component and passed into another variable value, if required.


Generating Python File

Some of the above options require that the Python to run is available as a file on the Matillion server. However the user may wish to write the file from within Matillion. Here we discuss options for doing this.


Python Script Component

The natural place to write the Python code is using the Python script component. Here you can take advantage of the colour coding of the code written and also use this approach to convert Python scripts already written into files to be run externally. The approach uses a Bash script component to export the job with the Python script using the Matillion API, parse out the relevant Python and write the results to a file in /tmp. It requires a number of inputs to make the API call and identify the correct script.

 

When the script has run it will either say it has successfully created the file or throw an error saying the API call was unsuccessful and why or throw an error saying the name of the Python script component was incorrect. Please note all variables are case sensitive.

A copy of the job is available for download here.


Bash Component

A different approach is to write the Python code directly in to a Bash Script component and use this to generate a file. This can be done with a heredoc. An example is in the screenshot:

The advantage of this approach is that it prevents the need for the API call and is also faster to create the file. It also means the users aren’t tempted to “Run” the code from within Matillion.


Conclusion

We recommend before developing Python scripts within Matillion that you carefully consider what the script will do and consider running Python scripts which require considerable resources remotely. For a large xml file, the Python script above can take up to 10 mins to run. Having this run on the Matillion server would affect the CPU and memory available to run the Matillion jobs and could also have an adverse effect for Matillion users.

Using the remote options discussed ensures the script can be initiated from Matillion but also run in isolation. The result of the script can easily be fed back to Matillion. There are, however, some cases where it is best to run the script directly within Matillion, for example any Jython code.