Skip to main content

Data Structure

Files Location

Each table, will contain files in a specific path with the format:  <tabl_name>/year=yyyy/month=mm/day=dd/hour=hh/<file_name>.parquet   For example transactions/year=2024/month=01/day=01/hour=01/part-00000-2bfb9dac-fdd5-4025-b8fa-174eedde9d45-c000.snappy.parquet.

Each of the hourly locations will contain one or more Parquet files, each containing a portion of the relevant hour incremental data. Do not count on specific amout of files, the amount can vary.

Incremental Data Content

Due to the nature of a Change Data Capture (CDC) pipeline, the data we transmit is incremental. This means that every change to a row—whether an INSERT, UPDATE, or DELETE—is treated as a separate incremental event. Each change results in the full row being written to the corresponding file. For DELETE operations, the row is transmitted with NULL values for all columns except the identifier (id) column.

note

Because the application is dynamic and constantly updating, you may occasionally receive the same row more than once, even if the visible data hasn't changed. This can happen, for example, when internal columns—those not configured to be included—are updated.

In such cases, you'll receive "duplicate rows".

note

An INSERT event is not guaranteed to be the first operation you receive for a given row. In certain tables, the association with a specific client is established only after the initial write to the database. Until that association is in place, the row is not eligible for transmission. As a result, the first event you receive for a row may be an UPDATE rather than the initial INSERT. This behavior is expected and depends on the table's design and data flow.

note

In some tables, the same row can be updated multiple times in the same timestamp (transaction). In order to get the latest state of the row, you should use a special column (cdc_change_seq). More info in the section below - special columns.

Special Columns

To track changes in the continuous data stream, there is a need for special columns that help identify the operation type, the change sequence and the commit timestamp. The three columns below are added to each table:

  1. cdc_operation_type - Operation type executed for the row (UPDATE, INSERT or DELETE).
  2. cdc_change_seq - A unique incrementing number. Use it for ordering the results.
  3. cdc_commit_timestamp - The commit timestamp.

Data Freshness

Data transfers occur hourly with changes from the previous hour. For each table, if there is a change, you will get it in the relevant Files Location.

note
  • Data updates may experience delays of up to 4 hours.
  • Column deletions occur automatically without a notice. The column will disappear from the future relevant files.
  • New columns and tables will be introduced over time as part of ongoing product development and enhancements.

Reloading Data

Our internal export process uses a checkpointing mechanism to track which data has already been delivered, ensuring efficient and consistent incremental updates. If you ever need to re-export data from a specific point in time, please contact the Unit team.

note

The schema of each table can be inferred easily from the parquet files. If you wish to get a table-columns mapping, please reach out to your Unit contact.