We have a use case where we load raw feed data from S3 onto EMR and use hive queries to create temp tables and transform the data before loading it onto Snowflake. I didn't notice any mention of Hive/EMR in the documentation/integration guide. Is there a custom way of doing this?
1 Community Answers
Ian Funnell —
Matillion doesn’t have dedicated components for defining or running EMR jobs: Matillion’s transformation components are for Snowflake (or BigQuery, or Redshift).
It sounds like what you’ll need to do is:
Continue with your current EMR jobs to process the S3 input data
Have the EMR jobs write their output to S3, and then load those files into Snowflake using an S3 Load Component. You could perhaps optionally use an S3 Put Object Component to copy files from HDFS into S3 yourself.
Use Snowflake to perform data merging, aggregation and further transformations, inside Matillion Transformation jobs.