Skip to main content

Intro

Magic ETL has four possible DataSet update methods in the Output DataSet tile: Replace, Append, Partition, and Upsert. The best update method for your DataFlow depends on your use case.

Update Methods

This table gives basic definitions for each available method. Follow the link to learn more about each method. To learn more about update methods, see our article on general DataSet Update Methods (outside of Magic ETL).

Item

Description

Replace

Each execution replaces the output DataSet’s contents with its latest results.

Append

Each execution appends its results to the output DataSet as a new batch.

Partition

Each execution groups rows into batches according to a name, which is the value of a specific column. Each unique batch name in a DataSet is called a partition. These partitions are all appended to the DataSet, and any pre-existing partitions with matching names are replaced.

Finally, if a partition filter expression is specified, all partitions are evaluated against it, and any that do not pass are deleted.

Upsert

Each execution uses a defined key (a value from a column or combination of columns that signifies a unique row in the DataSet) to compare the rows resulting from the execution with the rows already in the output DataSet.

Rows with keys that match existing rows in the output DataSet are updated in the output DataSet.
Rows with keys not already present in the output DataSet are appended to the
output DataSet.

Replace

Replace is the default update method for Magic ETL. It removes all existing data in the DataSet and replaces it with the new batch of data from the execution of a Magic ETL DataFlow. This method is good if your output DataSet is relatively small or if all the data needs to be updated every time the DataFlow executes.
Replace (1).gif

Append

Append takes all data resulting from the execution of a Magic ETL DataFlow and appends it to the output DataSet. This method doesn’t take into account any of the existing data in the output DataSet and appends the new data without considering potential duplication. Append is a good choice if each execution of your DataFlow results in net new rows that need to be added to the existing output DataSet. However, this update method can be brittle and result in duplicated or missing data.
Screenshot 2023-09-19 at 12.42.43 PM.png

Partition

Partitioning is a way of organizing your data into groups to make data updates faster and more efficient. The graphic below demonstrates replacing and appending data with partitions.
partition graphic.jpg

Partition Basics

A common way to understand partitions is to think of a physical encyclopedia — a collection of books that are part of the same set but individually different. If an update is made to one book and it is republished, you can buy a new copy of that book without replacing the entire set of books. Partitions divide your data into separate collections, or subsets, so that when you replace some data, you do not need to replace all of it. Appending data when using partitions is like buying a new volume of the encyclopedia and adding it to the end of your collection. When defining a partition in Magic ETL, you must identify or create a column that can divide the data into logical groupings. A good example is a date column in a transactions table: each date in the column represents a partition, and each partition contains multiple rows (transactions) within that partition. After partitions on a DataSet are defined, new data is evaluated based on the partition column. If data with the same partition name already exists in the output DataSet, the new data replaces ALL of the old data defined with the same partition name. If it is determined that no data for that partition name exists in the output DataSet, the new data is appended to the output DataSet as a new partition. For example, if your input DataSet contains data for March 1st, March 2nd, and March 3rd (and your partitions are defined based on that date), any data existing for March 1st, 2nd, or 3rd are replaced in the output DataSet with the new, incoming data.

Partition Best Practices

A good partition needs to be large enough to take advantage of partitioning’s enhanced processing speed and granular enough to meet your specific needs. For most situations, a date-based, day-level partition strikes a good balance between these two elements. However, if you need to keep more than five years’ worth of day-based partitions in a single DataSet, you may want to consider using a week-level or month-level partition instead. Some organizations need partitions that are more granular than day level. If that is the case for your organization, you can create a custom partition by combining two or more columns in your DataFlow. For example, if you have eight store locations, you might want to create a partition that combines the store ID and the date.
Important: There is a 3000-partition limit for each DataSet. If you create too many partitions, you may reach this limit and experience problems. In the example from the previous paragraph, if you create a partition key using store ID + date, you can only store 375 days’ worth of data (3000 partitions divided by 8 stores).

Configure Partitioning

Follow the steps below to configure partitioning as the update method for your DataSet in Magic ETL:
Note: These instructions are valid for creating a net new DataFlow with all new partitions.
  1. Drag an output DataSet to the Magic ETL canvas and select it to expand the editor at the bottom of the screen.
  2. In the Configuration tab of the editor, choose Partition as the update method.
  3. From the Partition name column dropdown, select the column you want to use to divide the data.
    Screenshot 2024-09-05 at 3.18.34 PM.png
  4. (Optional) If you want to configure a retention policy for which partitions should be kept in your output DataSet on an ongoing basis, toggle the switch labeled Specify which partitions to keep.
  5. (Conditional) If you enable the Specify which partitions to keep option, a field displays where you can enter an expression. This expression keeps any specified partitions; all other partitions are deleted. Note: This field only validates the [partition.name](http://partition.name) expression.
    Screenshot 2024-09-05 at 3.21.06 PM.png

Upsert

Upsert allows you to update rows in your output DataSet without reprocessing the full DataSet based on a unique row-level key. The graphic below demonstrates replacing and appending data with Upsert.
merge graphic.jpg

Upsert Basics

With Upsert, rather than replacing the entire DataSet with each DataFlow execution or appending all incoming rows of a DataFlow execution, data can be either appended or replaced, depending on whether the incoming rows are new or updates to rows that already exist in the output DataSet. Upsert uses a key, which indicates whether a record resulting from a DataFlow execution is new or already exists in the output DataSet. If the key value of a row matches that of an existing row, the incoming row replaces the existing row. If no match is found, the incoming row is treated as a new row and is appended to the output DataSet.

Upsert Best Practices

An Upsert key can be a value in a single column or a combination of values from multiple columns, but the key must indicate uniqueness for that row. You cannot identify an Upsert key if it has duplicate values within the column(s). You also cannot identify an upsert key if it has null values in the column. Upsert is a good update method if you have occasional updates to individual records spread throughout your DataSet, with relatively few records updating each time. Upsert is not a good fit if you are replacing many or most records in a DataSet each time (in these cases you can use Replace for better performance). The availability of Upsert as an update method in a Magic ETL DataFlow should remove the need to create recursive DataFlows to update output DataSets.

Upsert Configuration

Follow these steps to configure upsert as the update method for your DataSet in Magic ETL:
  1. Drag an output DataSet to the Magic ETL canvas and select it to expand the editor at the bottom of the screen.
  2. In the Configuration tab of the editor, choose Upsert from the Update Method dropdown.
  3. From the Upsert Key dropdown, select the column(s) you want to use to identify the Upsert key for your data.
    upsert output.jpg