Skip to main content

Intro

Data Selection allows you to load only the batches updated since the last successful execution, or you can specify the batches to be loaded based on their creation date or partition name. You can customize your approach to your organization’s specific needs. Learn about configuration and best practices for partitioning and other DataSet update methods in Magic ETL. This article covers how to configure Data Selection in your DataFlow and provides a use case for a custom expression.

Configure Data Selection for Input DataSet

Magic ETL provides three data selection methods when configuring your input DataSet for a DataFlow: All, New, and Filter by batch. These methods are described in the process below.
configure subset processing.jpg
Follow these steps to set up initial data processing in Magic ETL:
  1. Navigate to the Data Center and locate the DataSet for which you want to set up data processing in Magic ETL.
  2. Select Open With > Magic ETL. The Magic ETL canvas opens with your DataSet as the input DataSet.
    Screenshot 2024-09-05 at 4.37.51 PM.png
  3. Select the input DataSet to open the tile editor below the canvas. The editor has four tabs.
  4. In the Configuration tab, use the Data Selection dropdown to choose the method that best suits your needs.
    configuration tab.jpg
    The methods include:
    • All — This is the default method. Selecting this method loads all rows from your DataSet. To learn more about automatic optimization, see Magic ETL DataFlow Auto Append Processing.
    • New — Select this method if you want to load to the DataSet batches updated since the last successful execution, including restated partitions.

      If you use this method and your DataSet has a Replace operation, the next DataFlow execution loads all batches from the Replace and any batches appended since the Replace. Your outputs continue to perform the action that you have saved for the DataFlow (such as Append or Replace). Learn more about update methods in Magic ETL.
      Note: If you occasionally replace all batches in your input DataSet and always append to your output DataSet within the DataFlow, there may be duplicate data in your output DataSet.
    • Filter by batch — Select this method if you want to specify which batches to load based on their creation date or, in the case of partitions, their name. If you use this method, you must also complete the section of the editor labeled Filter By Batch, described below.

Filter by Batch

In the Import data when dropdown, choose one of the available options:
  • Data processed — Load data using a specific processing date. Then define the logic you want in the Date is in fields. Example: Data is in Last 5 Days.
    Screenshot 2023-03-28 at 8.39.20 PM.png
  • Custom expression — Load data using a custom expression. Use the freeform field to create your expression. The table below contains information about what values you can use to write your custom expression.

    When partitioning a DataSet, the process is based on the DataSet’s metadata rather than the values within the data itself. This means that when you define a subset of the DataSet, rather than selecting specific values within the data, you are selecting a subset of the metadata that defines the data.
    Screenshot 2023-03-28 at 8.40.07 PM.png
    Screenshot 2024-09-05 at 4.53.36 PM.png
  1. Make any changes to the Data Handling settings.
  2. Save your DataFlow. You can run a preview to test your configuration.

Use Case

The formula displayed in this image (code sample below) takes advantage of the fact that the input DataSet is partitioned by date. This means that the data is split into different partitions based on the date, which allows the formula to use date-related functions to filter the data. Learn more about partitioning.
Screenshot 2023-03-28 at 8.46.19 PM.png
This formula creates a filter that compares the partition name (which is a date) to the current date. If the current date is the first day of the month, the filter allows all partitions from the past year to pass through. If the current date is any other day of the month, the filter only allows the last 30 days’ worth of partitions to pass through. This filtering strategy enables you to process less data every day while still maintaining a full year’s worth of data one time per month. You can use this approach in various scenarios, and we recommend that you experiment with it.
DATE(batch.name) >= CURRENT_DATE() - (CASE WHEN DAYOFMONTH(CURRENT_DATE()) = 1 THEN 365 ELSE 30 END)