Skip to main content

Intro

There are three tiles available in the DataSets tab of the Magic ETL interface:
  • Input DataSet
  • Output DataSet
  • Writeback
Screenshot 2024-07-29 at 11.45.57 AM.png
There is an additional tile currently available only with Domo on Databricks:
  • Query DataSet
All Magic ETL DataFlows require at least one input tile and an Output DataSet tile. These tiles represent actual DataSets. Input tiles — Rather than always using an Input DataSet tile in the first position, you can substitute a SQL tile, which permits writing a custom SELECT statement, or a Query DataSet tile, which goes further by adding functionality for subdividing a query result set into smaller loads. Output DataSet tile — An Output DataSet tile is required at the end of the DataFlow to power Domo cards and apps. This article describes how to configure the Input DataSet, Output DataSet, Writeback, and Query DataSet tiles. Learn more about the SQL tile.

DataFlows and Personalized Data Permissions (PDP)

Important: Input DataSets in a DataFlow can’t be restricted by PDP policies—all available rows must pass through the DataFlow. This means you must apply PDP policies to the output DataSets of a DataFlow, rather than to the input DataSets.When you build a DataFlow using an input DataSet with PDP policies in place, the DataFlow breaks unless at least one of the following criteria applies:
  • You have an Admin security profile or a custom role with the Manage DataFlows grant.
  • You are the DataSet owner.
  • You are part of the DataSet’s All Rows policy. This gives you access to all of the rows in the DataSet.
To learn how to use PDP with DataFlows, see PDP and DataFlows/DataFusions.


Add an Input DataSet to a DataFlow

Input_DataSet_Tile.png
You must add at least one input DataSet to a DataFlow. Virtual DataSets are different from Federated DataSets and are treated differently in Magic ETL. While you can use a Virtual DataSet as an input to Magic ETL, you can’t use a Federated DataSet. Follow the steps below:
  1. Open Magic ETL.
  2. From the left rail under DataSets, drag an Input DataSet tile to the canvas, opening the tile editor below the canvas.
  3. In the tile editor, select Choose DataSet to select the input DataSet for the DataFlow.
    Note: The maximum number of columns allowed in Magic ETL is 1500 columns.
  4. In the Select a DataSet modal, find and choose the DataSet. To add more than one input DataSet, you must repeat steps 2–4 for every input DataSet.
    choose dataset.jpg
  5. (Optional) In the tile editor, go to the Data tab to see a preview of the input data.

Configure Input DataSet Tiles

View DataSet details: The Details tab of the tile editor includes information about the DataSet like its name, owner, update dates, and column names and types.
Follow these steps to configure an Input DataSet tile:
  1. Drag an Input DataSet tile onto the canvas to open the tile editor.
  2. In the tile editor, select Choose DataSet. In the modal that displays, select either an existing DataSet or Create New to use a new DataSet.
    choose dataset.jpg
  3. In the Configuration tab of the tile editor, use the Data Selection dropdown to choose one of the following:
    data selection dropdown.jpg
    • All (default) Loads all rows from the DataSet.
    • New — Loads new rows added to the DataSet since its last successful execution. Note: If you choose this option, you must set the data update method to Append, not Replace.
    • Filter by batch — Allows you to choose which batches to load based on creation date or name (for partitions). Choosing this option displays the Filter by batch area. Choose a) whether batches are filtered based on when the data was processed or a custom expression and b) the date parameters. Example: Data processed (in the) last 7 days.
      Screenshot 2024-08-01 at 7.18.28 PM.png
  4. (Optional) If your input DataSet is partitioned, you can expose the partition name using the Partition Column Name field.
  5. In the Data Handling area, make any changes to column data types. This prevents any schema changes to the input data from breaking your DataFlow.
    change data type.jpg
  6. Use the Transform settings to set the special handling for each column, including for Bad values and Null values in your column.
    data settings.jpg
    Learn how to automate DataFlow execution with Advanced DataFlow Triggering.

Output DataSet Tile

Output_DataSet_Tile.png
You must have an output DataSet in a DataFlow. Use the output DataSet to power up cards (or other DataFlows).

Configure Output DataSet Tiles

Follow these steps to configure an Output DataSet tile:
  1. Make sure all the tiles in your DataFlow are connected and configured.
  2. In the left panel under DataSets, drag an Output DataSet tile to the canvas and connect it.
  3. In the tile editor, rename the tile. This is the name of the output DataSet that will display in the Data Center.
  4. Use the Update Method dropdown to choose how to handle updated data. Learn about update methods in Domo.
    Tip: You can preview the data in the output DataSet by running a preview, selecting the output DataSet tile, and then selecting the Preview tab.
    preview output.jpg
    View DataSet details: After the DataFlow has run at least once, the Details tab of the tile editor includes all information about the output DataSet.
  5. Select where to write your data by using the Add DataSet to Cloud dropdown.
  6. Enter the name of the Database where you want to store the data in your cloud.
  7. Enter the name of the Schema where you want to store the data in your database.
  8. Enter the name of the Table you want to use inside your schema.
Notes:
  • The option to Add DataSet to Cloud is only available if you are using Cloud Amplifier and have an integration that be configured for writeback.
  • The Database, Schema, and Table fields will be available if your Cloud Amplifier integration supports location parameters inside your cloud (only Snowflake for now, but coming soon to other engines).

Query DataSet Tile

Important: Currently, this tile is only available for Domo on Databricks.
You can use the Query DataSet tile to write a query against a DataSet as part of the input configuration process. This can increase efficiency in loading large tables, accessed through Cloud Amplifier, which may be delayed by resource limitations in the underlying Cloud Data Warehouse (CDW). Proper use of this tile may resolve error messages like the following:
socket timeout?????????
exceeding maximum heap size of 16 GB???????
Although the Query DataSet tile can be used with various data sources, we strongly recommend it for Databricks. Apart from a custom SELECT statement, users can specify values for two additional fields:
  • Query partition key
  • # loads

Configure the Query DataSet Tile

Follow these steps to configure the Query DataSet tile:
  1. In Magic ETL, expand the DataSets section in the left sidebar and drag a Query DataSet tile to the canvas. Note: This is only available for Domo on Databricks users.
  2. In the tile configuration panel below the canvas, use the Choose DataSet dropdown to select a Databricks DataSet.
  3. Write the query that identifies the data you want to bring into the Magic ETL DataFlow.
Note: This query must be a select query only, and cannot use CTE (common table expressions).
  1. Configure the Query Partition Key. The recommended process for determining this for Databricks tables appears below.
    Warning: The partition key column should not contain NULL values; these will be ignored. You should only choose a column with exclusively non-NULL values as the partition key.
  2. Configure the # of Loads section?

Choose the Partition Key (Recommended Process for Databricks)

When choosing the partition key, the more evenly distributed the row count per column value, the better. For Databricks, we recommend choosing the partition key and defining the number of loads so that the largest batch is no larger than 4GB. Follow the steps below to do this:
  1. Find the overall size of a table in the Databricks UI or by querying as follows: DESCRIBE DETAIL <catalog>.<schema>.<table>
  2. Convert the sizeInBytes to GB by dividing by 2^30 = 1024^3.
  3. Divide by the recommended 4GB to obtain a preliminary value for the number of necessary divisions, as follows: divisions = (sizeInBytes / 2^30 ) / 4 Note: This is only an intermediate calculation, which should not be interpreted as the desired number of loads.
  4. Find the table’s total row count.
  5. Estimate how many table rows make 4GB, the limit for the largest batch, using the following equation: max_rows = row_count / divisions
  6. You can now begin evaluating which column to use as the partition key. Remember that the column must not contain any NULL values. Find the number of distinct values present in a candidate column using the following formula:

    SELECT COUNT(DISTINCT <column> ) AS distinct_count
    FROM <table>;

    You’ll use the value for distinct_count in a later calculation.
  7. Find out how skewed or lopsided the table would be if partitioned according to this candidate column by examining the largest row counts using this formula:


SELECT <column>, COUNT(*) AS row_count
FROM <table>
GROUP BY <column> ORDER BY row_count DESC;

A batch consists of all rows associated with N distinct partition key values. In a worst-case scenario, the largest batch would be comprised of the partition key values with the N largest row counts. The running total is represented by this query:

WITH grouped_counts AS (
SELECT <column>, COUNT(*) AS row_count
FROM <table>
GROUP BY <column>
)
SELECT <column>, row_count, SUM(row_count )
OVER (
ORDER BY row_count DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_row_count
FROM grouped_counts
ORDER BY row_count DESC;

For example, let’s say a date <column> is being valued as a partition key candidate:
<column> row_count cumulative_row_count
2024-10-13 100 100
2024-02-21 80 180
2024-05-17 60 240
2024-08-19 50 290
2024-03-06 30 320
2024-02-10 20 340
2024-11-30 10 350
Let’s go a step further. How many distinct values for this <column> could be included in a batch while remaining below the maximum row count (based on 4GB) that we calculated earlier? Here is the query we need:

WITH grouped_counts AS (
SELECT <column>, COUNT(*) AS row_count
FROM <table>
GROUP BY <column>
),
ranked_counts AS (
SELECT <column>, row_count, SUM(row_count )
OVER (
ORDER BY row_count DESC
ROWS BETWEEN UNBOUNDED PRECEDING
xAND CURRENT ROW
) AS cumulative_row_count
FROM grouped_counts
)
SELECT COUNT(*) AS max_column_values
FROM ranked_counts
WHERE cumulative_row_count < max_rows;

Replace max_rows with the value calculated in step 4, above. Then execute the preceding query to obtain an integer result: max_column_values.
  1. If you decide to use this candidate column as the partition key, use the results from steps 6 and 7 to calculate the minimum value for the number of loads required to ensure that the largest load remains below 4GB, as follows, rounding up: “# loads” >= distinct_count / max_column_values Note that the number of loads necessary to remain below 4GB will depend on the choice of column for the partition key. To get a smaller number of loads, choose a column where the values are more uniformly distributed across the table.

Writeback Tile

Writeback_Tile.png
You can use the Writeback tile to output the transformed data as a DataSet and send the data to another system.

Configure Writeback Tile

Follow these steps to configure a Writeback tile:
  1. Drag a Writeback tile to the canvas in place of an Input DataSet tile.
  2. Rename the Writeback tile and then specify the DataSet name you want to output by selecting Edit and entering a name and a description.
  3. (Optional) Select the Additional Options menu and select Update Method to either replace (default) or append data.
  4. Select Configure Writeback, then select the Writeback Connector you want to use.
  5. Connect other tiles to the Writeback tile to complete your DataFlow and end with an Output DataSet tile.
    View Writeback tile details: After the DataFlow has run at least once, the Details tab of the tile editor includes all information about the Writeback output DataSet.

Fixed Input Tile

You can use this tile to validate transformations, prototype logic, demonstrate outputs with meaningful placeholder data, or build DataFlows when your input DataSet isn’t available. Find the Fixed Input tile under Datasets in the left panel of the Magic ETL authoring canvas.
Note: To use AI features in the Fixed Input tile, your instance must have AI Services enabled. If AI Services are not enabled, you can still enter data manually, paste values, or upload files.

Input Data Manually into Your Fixed Input Tile

To manually input data, follow these steps:
  1. Drag the Fixed Input tile from the Datasets section to your canvas.
  2. Enter column names in the New Column header fields.
  3. Select the small text under the left side of each column name to set the data type.
  4. Enter data directly into the table.
  5. (Optional) Copy and paste data from another source into the rows and columns.

Upload a File to Your Fixed Input Tile

To upload a file, follow these steps:
  1. Drag the Fixed Input tile from Datasets to your canvas.
  2. Select the Generate data / Upload button in the lower-right corner.
    A screenshot of a computerAI-generated content may be incorrect.
  3. In the modal, go to the Upload File tab.
    A screenshot of a computerAI-generated content may be incorrect.
  4. Drag and drop your file or select the Upload File button.
  5. Specify whether the file includes header rows, then select Import. Your data appears in the configuration panel. You can add columns, edit values, or use it as-is.
Important: You can upload up to 10,000 rows and 1,000 columns per file. Supported formats include.csv and.xlsx.

Use AI to Generate Sample Data for Your Fixed Input Tile

To generate sample data using AI, follow the steps below.Alternatively, you can generate AI data for a single column by selecting the sparkle icon in that column’s header. Enter a short description of the values and the number of rows. This method is faster than generating an entire DataSet but may still take a few seconds.
  1. Drag the Fixed Input tile to your canvas.
  2. Select the Generate data / Upload button.
  3. In the Generate Data tab, enter a short prompt describing the data (for example: “50 customer transactions with name, date, and amount”).
  4. Choose the number of rows to generate (recommended: 50).
    A screenshot of a data generatorAI-generated content may be incorrect.
  5. Select Generate Data.
  6. Wait a few moments for the table to populate.

Save and Use the Data in Your Fixed Input Tile

When your Fixed Input data is ready, connect it to any downstream transformation tile in your DataFlow. The data is stored in the DataFlow definition—not as a standalone DataSet. To create a separate DataSet, connect the tile to an Output DataSet tile. When the DataFlow runs, Domo creates a new DataSet from that output. Learn how to configure Output DataSet Tiles.

FAQ

Domo uses a precision/scale of (19,5) for fixed decimals, meaning that fixed decimals will only ever have 5 digits to the right of the decimal point. You can request that your DataSet be configured to use a higher precision/scale.
Uploaded files are limited to 10,000 rows and 1,000 columns. There’s no defined limit for manually entered or pasted data, though future limits may apply.
No. Data stays in the DataFlow definition. To create a standalone DataSet, connect it to an Output tile.