Fixed Schema Definitions With MongoDB and DynamoDB
In this article we will see how we can force the MongoDB and DynamoDB Query components to resolve Data sources (collections) and available columns from predefined schema-definition files (RSDs) rather than use automatic schema discovery to determine these properties.
The process involves generating a bunch of RSDs which define each collection, its constituent columns and their datatypes. Once generated, the component can be (optionally) forced to use these schema definitions instead of scanning a sample dataset to determine this information during design time.
Generating schema files
The MongoDB and DynamoDB components can generate these RSDs automatically by adding certain connection options.1. Go to: Project → Manage API Profiles.
2. Create a new profile and name it appropriately, e.g. 'MongoDB_DBName'. Remember this <name> as we will be using it exactly in the following steps.
3. Close the API Profiles screen.
4. Create a new Orchestration job and Configure a new MongoDB or DynamoDB component (or copy a preconfigured component from another job). Ensure it is pointing at the database you are interested in generating the schema for.
5. Add the following connection options:
⚬ Location: /usr/share/tomcat8/api_profiles/<name>
⚬ GenerateSchemaFiles: OnStart
6. Run the component (Right-click + Run Component)
7. Remove the Connection Options Location and GenerateSchemaFiles to avoid accidentally generating them again. Optionally, delete this job if you no longer need it.
The above run will generate a set of RSD files in folder: /usr/share/tomcat8/api_profiles/<name>
You may use any other location on the matillion server. The API Profile editor defaults to this location when editing profiles and we can use the editor to review/edit these schema definitions as explained below.
Reviewing/Editing the Generated schema
The API Profile editor is a simple text editor in matillion that is typically used to manage Profiles defined for the API Query component. By default it looks for available profiles in folder - /usr/share/tomcat8/api_profiles. Choosing to use the same location helps us use the editor to manage the schema definitions we generate.
To review or edit the generated schema files,
1. Go To Project → Manage API Profiles
2. Edit profile <name>.
You will now see a bunch of schema definitions to the left. You may change these as desired or create new definitions by copying from existing. Please note that the name of the Data-Source (RSD) should match the name of your collection in MongoDB/DynamoDB.
Using generated schema definitions
The Location connection option determines if a component should look for an existing schema definition or query MongoDB/DynamoDB for schema related information.
To force a component to use schema definitions, add the Location Connection Option and set it to the appropriate folder - in our example, its /usr/share/tomcat8/api_profiles/<name>.
Next time you click on the Data Source button, it will only list the profiles found in that location and any columns defined in those files. The component will no longer query for a list of tables/collections nor try to determine the columns and their data types.
Dealing with rare columns
One of the challenges with a schemaless databases like MongoDB and DynamoDB is that not all columns may be present in all documents. This may lead to certain columns (rare) missing a column-definition for in the RSD that is generated.
Typically, the component will analyse ~100 rows from each collection to identify the available columns and the most-suitable data type. If the rare column is not see in the sample then a definition will not be created.
The RowScanDepth Connection option may be used to adjust the number of rows scanned from each collection. If the rare column exists in this now-larger sample then a definition definition will be created.
Alternatively, you may generate the basic definition and then edit it to manually add the column definition yourself as outlined in the section above.