Skip to main content

xls2yml

Synopsis

starlake xls2yml [options]

Description

Generate Starlake YAML configuration files from Excel spreadsheets. This command is designed for teams where business analysts define data schemas in Excel (a familiar format), and data engineers generate the corresponding Starlake YAML files automatically.

The Excel workbook must follow a specific structure with predefined sheets and columns. Starlake reads the workbook and produces:

  • A _config.sl.yml file per domain
  • A {table_name}.sl.yml file per schema/table
  • Optionally, IAM policy tags and ACL/RLS policies

See also: yml2xls (reverse operation), xls2ymljob (job/transform variant).

Parameters

ParameterCardinalityDescription
--files <value>RequiredList of Excel files (.xlsx or .xls) describing domains & schemas. Can also be a directory (recursively finds all .xlsx files).
--iamPolicyTagsFile <value>OptionalPath to an Excel file for generating IAM PolicyTags YAML.
--outputDir <value>OptionalPath for saving the resulting YAML files. Defaults to the Starlake domains path.
--policyFile <value>OptionalExcel file for centralizing ACL & RLS definitions.

Excel Workbook Structure

The Excel file must contain the following sheets:

Sheet: _domain

Defines the domain (one row). Maps to _config.sl.yml.

ColumnRequiredDescription
_nameYesDomain name (maps to a database schema/dataset)
_pathNoDirectory path for incoming files
_ackNoAcknowledgment file pattern
_descriptionNoDomain description
_renameNoRename the domain in the target warehouse
_tagsNoComma-separated tags
_dagRefNoDAG reference for orchestration
_frequencyNoSchedule/frequency (e.g., daily, cron expression)
_freshnessNoFreshness rules as key=value pairs (e.g., warn=24h,error=48h)

Sheet: _schemas

Defines tables/schemas (one row per table). Each table listed here must have a corresponding attribute sheet.

ColumnRequiredDescription
_nameYesTable/schema name (must match a sheet name in the workbook)
_patternYesRegex pattern for matching incoming files (e.g., orders.*.csv)
_writeNoWrite strategy: OVERWRITE, APPEND, ERROR_IF_EXISTS, SCD2
_formatNoFile format: DSV, POSITION, XML, JSON
_headerNoWhether the file has a header row (true/false)
_delimiterNoField separator character (e.g., ,, |, ;)
_delta_columnNoTimestamp column for merge/upsert operations
_merge_keysNoComma-separated merge key columns
_descriptionNoTable description
_encodingNoFile encoding (default: UTF-8)
_partitioningNoComma-separated partition columns
_clusteringNoComma-separated clustering columns
_primary_keyNoComma-separated primary key columns
_tagsNoComma-separated tags
_renameNoRename the table in the target warehouse
_policyNoAccess policy name (references the _policies sheet)
_presqlNoPre-SQL statements (separate multiple with ###)
_postsqlNoPost-SQL statements (separate multiple with ###)
_dagRefNoDAG reference
_frequencyNoSchedule/frequency
_freshnessNoFreshness rules as key=value pairs

Sheet: _policies (optional)

Defines ACL and RLS policies.

ColumnRequiredDescription
_nameYesPolicy name
_predicateNoWHERE clause for RLS (defaults to TRUE)
_grantsYesComma-separated list of groups/users
_descriptionNoPolicy description

Sheet: _iam_policy_tags (optional)

Defines IAM policy tags for column-level access control.

ColumnRequiredDescription
_policyTagYesPolicy tag identifier
_membersYesComma-separated members
_roleYesIAM role

Attribute Sheets (one per table)

For each table listed in _schemas, create a sheet with the exact same name as the _name value. Each row defines one attribute/column.

ColumnRequiredDescription
_nameYesAttribute/column name. Use dot notation for nested structs (e.g., address.street).
_typeYesData type: string, integer, long, double, boolean, date, timestamp, struct, etc.
_renameNoRename the column in the target warehouse
_requiredNoWhether the field is required (true/false)
_privacyNoPrivacy transformation: MD5, SHA1, INITIALS, HIDE, or SQL:expression
_metricNoMetric type: CONTINUOUS, DISCRETE
_defaultNoDefault value
_scriptNoTransformation script (e.g., current_date())
_descriptionNoAttribute description
_position_startNoStart position for POSITION format (1-indexed)
_position_endNoEnd position for POSITION format (1-indexed)
_trimNoTrim strategy: LEFT, RIGHT, BOTH
_ignoreNoIgnore this attribute (true/false)
_foreign_keyNoForeign key reference (e.g., other_domain.other_table.column)
_tagsNoComma-separated tags

Example

Given an Excel file starbake.xlsx with:

  • A _domain sheet defining domain "starbake"
  • A _schemas sheet listing tables "customers" and "orders"
  • A customers sheet with columns (id, name, email, signup_date)
  • An orders sheet with columns (order_id, customer_id, amount, order_date)
starlake xls2yml --files starbake.xlsx

This generates:

metadata/load/starbake/
├── _config.sl.yml # Domain definition
├── customers.sl.yml # customers table schema + attributes
└── orders.sl.yml # orders table schema + attributes