Executing Python Scripts Outside of Matillion
    • Dark
      Light

    Executing Python Scripts Outside of Matillion

    • Dark
      Light

    Article Summary

    Overview

    Matillion has the functionality to run Python scripts as part of a Matillion job run. This is used for setting variable values, running quick queries on the target database using database cursors, or connecting to other AWS services. However, there are some circumstances where the amount of processing required by the Python script means it can be more sensible to run the Python script elsewhere to minimise the impact on the rest of the Matillion server. This can be because the Python script requires a large amount of processing power, which may affect the resources on the EC2 instance, which is used to host the Matillion application or because of corporate requirements around security or execution isolation.

    Examples of where this is the case include processing large files such as running "XML" to "JSON" conversion, or "zipping/unzipping" large files, or running "API Calls" that require or return a lot of data. In this blog we will look at a simple Python script that is used to convert an XML file to a JSON file so it can be loaded into the target database. This script, while simple, can require a lot of memory and processing power to convert large files.

    Below are some options for running Python scripts outside of the Matillion instance using the Bash Script component.

    SSH

    It’s possible to run the script remotely on another EC2 instance using an SSH command to connect to the instance. If required, the AWS CLI can even be used to start and stop the instance on demand. The SSH command is used to connect to the instance using a key stored on the Matillion Server. The script is run using the "python" command. In this example, the Python script is passed through to the remote server using a "heredoc". Below is an example script:

    ssh -o StrictHostKeyChecking=no -i id_rsa ec2-user@ip-172-31-30-130.eu-west-1.compute.internal "python" << END_OF_PYTHON  
    

    import xmltodict, json, boto3

    s3 = boto3.resource('s3')
    s3.meta.client.download_file('mtln-test', 'example.xml', '/tmp/example.xml’)

    with open ("/tmp/example.xml", "r") as myfile:
    xml_data=myfile.read()

    o = xmltodict.parse(xml_data)
    #print json.dumps(o)
    open('/tmp/example.json', 'w').write(json.dumps(o))
    s3.meta.client.upload_file('/tmp/example.json, 'mtln-laura-test', 'example.json')
    print 'file converted successfully'

    END_OF_PYTHON

    The result of the Python script will be printed in the Bash window:

    Run Bash Script Component

    Run Bash Script Component

    Advantages:

    • Control over the memory allocated on the remote server to the Python script.
    • All libraries can be installed and maintained on the remote server.
    • Version of Python can be changed as required.

    Disadvantages:

    • Cost and maintenance implication of a new server.
    • Cannot reuse Matillion connection into Redshift to run commands on Redshift (Database Cursor) although a new connection could be used.
    • Variables are not available in the global namespace, nor can they be updated i.e. it’s not possible to reuse the Matillion "context" object provided. However, referencing variables using the syntax ${this_syntax} will resolve to the variable value before the Bash script executes.


    Lambda

    AWS Lambda allows you to run code without worrying about servers. You simply pay for the compute time you use. "Python 2.7", "Python 3.6" and "Python 3.7" are amongst the languages supported. To run a Python script in AWS Lambda, a Lambda function must first be created and then run when required. Functions can be created and run through the AWS web console. A simple example function is shown below:

    Lambda - AWS Web Console

    Lambda - AWS Web Console

    The AWS CLI supports creating a Python function in Lambda and then running it. These CLI commands can be run from a Bash Script component within Matillion. The example script below assumes the helloworld.py Python script exists on the Matillion server in the /tmp directory and walks through the steps required to use this to create a Lambda function from it and run it accordingly.

    Please note the below will redefine the function over and over. If the code does not change, it can simply be run using the run statement:

    #Lambda requires function to be run is called index.py so first rename python function
    cp /tmp/xmltojson.py index.py
    

    #zip function up
    zip function.zip index.py

    #delete function if already exists
    aws lambda delete-function --function-name helloworld --region eu-west-1

    #create function in Lambda
    aws lambda create-function --function-name helloworld --zip-file fileb://~/function.zip --handler index.handler --runtime python2.7 --role arn:aws:iam::557249109363:role/lambda-cli-role --region eu-west-1

    #run function

    aws lambda invoke --function-name helloworld --log-type Tail /tmp/outputfile.txt --region eu-west-1

    #delete again
    aws lambda delete-function --function-name helloworld --region eu-west-1

    Lambda - Matillion

    Lambda - Matillion

    Please Note

    AWS Credentials for the Environment used to run this job requires the IAM policy AWSLambdaBasicExecutionRole.

    Advantages:

    • No need to worry about servers.
    • Just pay for compute resource used - there’s no chance of accidentally leaving a server ON.
    Disadvantages:
    • Script can’t run for more than 15 mins.
    • Requires access IAM policy AWSLambdaBasicExecutionRole.
    • Installing dependencies must be done by having all dependencies locally and packaging them with the lambda function or using the console.
    • Cannot reuse Matillion connection into Redshift to run commands on Redshift (Database Cursor).
    • Output from function is not displayed in the Task History, instead it has to be read from a separate file.
    • Requires AWS Lambda which has a cost implication.

    AWS Glue

    AWS Glue can be used to create and run Python Shell jobs. These are Python scripts which are run as a shell script, rather than the original Glue offering of only running PySpark. Any script can be run, providing it is compatible with Python2.7 version.

    As with the Lambda function, first an AWS Glue ETL job must be created, as a Python Shell job, and then executed. The AWS Glue job is created by linking to a Python script in S3, a IAM role is granted to run the Python script under any available connections, such as to Redshift are selected in the example below.

    AWS Glue Configuration

    AWS Glue Configuration

    Again, the Glue Job can be created either via the console or the AWS CLI. Below are the steps to create and run the job via the AWS CLI from a Bash script component within Matillion. If the code doesn’t change, we recommend creating this job once and simply running the job from Matillion:

    #script must exist on S3 for glue
    aws s3 cp /tmp/xmltojson.py s3://mtln-laura-ireland/xmltojson.py --region eu-west-1
    aws glue create-job --name python-job-cli --role jc-glue-role --command '{"Name" :  "LauraPythonTest", "ScriptLocation" : "s3://mtln-laura-ireland/xmltojson.py"}' --region eu-west-1
    

    #output results to a variable to use in Cloudwatch command
    JOBOUTPUT=$(aws glue start-job-run --job-name LauraPythonTest --region eu-west-1)

    #get job id from json
    JOBID=$(echo $JOBOUTPUT| cut -d '"' -f 4)
    echo $JOBID

    #get logs
    sleep 30

    aws logs get-log-events --log-group-name /aws-glue/python-jobs/output --log-stream-name $JOBID --region eu-west-1 > /tmp/output.txt

    Executing script in Matillion

    Executing script in Matillion

    In AWS Glue, the output of a job run is written to Cloudwatch. If these are required in Matillion, the output of the command has to be written to a file on the Matillion server.

    Important Information

    To run this script the Matillion environment requires the below policies:

    • AmazonGlueServiceRole
    • IAMReadOnlyAccess

    Things to consider when using Glue:

    • Can specify a script timeout.
    • In the glue script, parameters can be defined and referenced in the Python. When the script is run, values for the parameters can be provided. This can be done by using Matillion variables in the Bash script. This means Glue scripts can be re-usable scripts and avoid having to push a new script with each execution.
    • Must be Python 2.7 or above.
    • Limited libraries - but can upload your own.
    • Python script needs to be on S3.
    • Cannot reuse Matillion connection into Redshift to run commands on Redshift (Database Cursor).
    • Requires AWS Glue which has a cost implication.

    Example with Variables

    It is often the case that there is a requirement that part of the Python script is made dynamic. This is achieved using variables in Matillion. For example in the SSH example discussed earlier, the S3 Bucket and file names were hard coded but it is unlikely that these filenames will always be consistent.

    It is also unrealistic to update the Python script to simply update the file or the bucket names. Instead job variables in the Matillion job can set up and these can be referenced in the heredoc, which pushes the variable value to the remote server:

    Manage Job Variables

    Manage Job Variables

    ssh -o StrictHostKeyChecking=no -i id_rsa ec2-user@ip-172-31-30-130.eu-west-1.compute.internal "python" << END_OF_PYTHON
    

    import xmltodict, json, boto3

    s3 = boto3.resource('s3')
    s3.meta.client.download_file('$s3Bucket', '$xmlFile', '/tmp/$xmlFile')

    with open ("/tmp/$xmlFile", "r") as myfile:
    xml_data=myfile.read()

    o = xmltodict.parse(xml_data)

    #print json.dumps(o)
    open('/tmp/$xmlFile.json', 'w').write(json.dumps(o))
    s3.meta.client.upload_file('/tmp/$xmlFilejson', '$s3Bucket', '$xmlFile.json')
    print 's3://$s3Bucket/$xmlFile.json'

    END_OF_PYTHON

    Please Note

    • The script has also been edited to return the bucket and filename created.
    • The output from the script run can also be pushed to an output file, which can then be read by another Matillion component and passed into another variable value, if required.

    Generating Python File

    Some of the above options require that the Python to run is available as a file on the Matillion server. However the user may wish to write the file from within Matillion. Matillion offers two componenets for doing this.

    Python Script Component

    The natural place to write the Python code is using the Python script component. Here you can take advantage of the colour coding of the code written and also use this approach to convert Python scripts already written into files to be run externally.

    Python Script Component

    Python Script Component

    The approach uses a Bash script component to export the job with the Python script using the Matillion API, parse out the relevant Python and write the results to a file in /tmp. It requires a number of inputs to make the API call and identify the correct script.


    Export job using Bash Script

    Export job using Bash Script

    When the script has run it will either say it has successfully created the file or throw an error saying the API call was unsuccessful and why or throw an error saying the name of the Python script component was incorrect. Please note all variables are case sensitive.

    Bash Component

    A different approach is to write the Python code directly in to a Bash Script component and use this to generate a file. This can be done with a heredoc. An example is in the screenshot:

    Bash - Manage Script

    Bash - Manage Script

    The advantage of this approach is that it prevents the need for the API call and is also faster to create the file. It also means the users aren’t tempted to "Run" the code from within Matillion.


    Conclusion

    We recommend the following points to consider before developing Python script in Matillion:

    • Ensure the purpose of the script, what the script will be do.
    • Consider running Python scripts which require considerable resources remotely.
    • Using the remote options discussed, ensures the script can be initiated from Matillion but also run in isolation.
    • The result of the script can easily be fed back to Matillion. There are, however, some cases where it is best to run the script directly within Matillion, for example any Jython code.
    • For a large xml file, the Python script can take up to 10 mins to run.

    Warning

    Running large XML files of Python script on Matillion server, would affect the CPU and memory available to run the Matillion jobs and could also have an adverse effect for Matillion users.