> For the complete documentation index, see [llms.txt](https://upsolver.gitbook.io/content/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://upsolver.gitbook.io/content/reference-1/sql-commands/iceberg-tables/upsolver-managed-tables/create-iceberg-table.md).

# CREATE ICEBERG TABLE

## Syntax

{% code overflow="wrap" %}

```sql
CREATE ICEBERG TABLE <table_identifier>
([ <column_name> <column_type> [, ...]])
    { PARTITION | CLUSTER } BY <column_name> [, ...]
    PRIMARY KEY <column_name> [, ...]
    -- Specify sort order (default is ASC)
    ORDER BY <column_name> [ASC | DESC] [, ...] -
    [{ table_options }];
```

{% endcode %}

#### Jump to

* [Table identifier](#table-identifier)
* [Column types](#column-types)
* [Partition/Cluster clause](#partition-cluster-clause)
* [Primary key clause](#primary-key-clause)
* [Order by clause ](#order-by-clause)
* [Table options](#table-options)

## Table identifier

Table identifiers are provided in the following format:

```sql
<catalog_name>.<schema_name>.<table_name>
```

Note that only metastore connection types are accepted for the catalog name.

Valid table names match the following identifier format:

```sql
identifier = "([^"]|"")*"|[A-Za-z_][A-Za-z0-9_]*;
```

## Column types

```sql
COLUMN_TYPE = { DATE 
              | TIMESTAMP 
              | BIGINT 
              | DOUBLE 
              | STRING 
              | BOOLEAN }
```

## Partition/Cluster clause

```sql
{ PARTITION | CLUSTER } BY <column_name> [, ...]
```

Upsolver enhances Iceberg's core functionality by supporting two data organization strategies: **Partitioning** and **Clustering**. Partitioning is a native feature of Iceberg, allowing efficient data organization to significantly improve query performance by optimizing how data is stored and accessed. Clustering, an additional feature provided by Upsolver, further optimizes performance for high-cardinality datasets by organizing related data rows within larger files.

**Note**: You can choose either **PARTITION BY** or **CLUSTER BY** for your table, allowing you to select the strategy that best fits your data's structure and usage patterns.

General

* Partitions\Clusters can be defined at table creation or later with ALTER command.&#x20;
* Supported column types include `string`, `bigint`, `Boolean`, and `date`.&#x20;
* For staging tables ingesting external data, partition columns must match incoming field names exactly and be at the root level.
* If this is a target table that you intend on writing an aggregated output to, the partition column should be one of the aggregation key columns and not an aggregated column.

**Partitioning**

```
PARTITION  BY <column_name> [, ...]
```

Each unique value within the specified columns creates a separate partition in the underlying storage, allowing queries to read only the files associated with a specific partition. This reduces the need to scan the entire dataset.

For more information, see [Working with partition indexes](https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html).

**Clustering**

```
CLUSTER BY <column_name> [, ...]
```

Adaptive clustering merges small files into a larger, non-partitioned file, reducing the overall file count while maintaining an organized structure. Clustering is ideal for high-cardinality datasets or when multiple partitions are being written to in parallel. Periodically, merged files may be moved into appropriate partitions if they represent a significant portion of data.

For details on usage and benefits, refer to our [Adaptive Clustering](https://docs.upsolver.com/content/articles-1/data/iceberg-adaptive-clustering) article.

## Primary key clause

```sql
PRIMARY KEY <column_name> [, ...]
```

A table's primary key column(s) contains the values that uniquely identify each row.

When using an [`INSERT`](/content/reference-1/sql-commands/jobs/create-job/transformation/insert.md) job to write to a table, values are inserted or updated based on the primary keys. This means that when writing to a table with no primary keys defined, the job only ever appends the new data that arrives; no existing values are updated.

In order to use a table in a [`MERGE`](/content/reference-1/sql-commands/jobs/create-job/transformation/merge.md) job, at least one primary key column should be defined.

Additionally, if this is a staging table that you intend on using to ingest external data with `COPY FROM`, you should not define any primary key columns.

## Order by clause

The data within a partition can be sorted by column(s) in order to gain performance. To do this, include the `ORDER KEY` clause, followed by the name of the column(s).&#x20;

{% code overflow="wrap" %}

```sql
-- Specify sort order (default is ASC)
ORDER BY <column_name> [ ASC | DESC ] [, ...] 
```

{% endcode %}

For more information, please refer to the [Apache Iceberg documentation](https://iceberg.apache.org/spec/#sorting).

The `ORDER BY` clause allows Upsolver to sort data within partitions by columns to gain performance.&#x20;

The sort order is defined by a list of sort fields. The order of the sort fields within the list defines the order in which the sort is applied to the data. Each sort field has a sort direction, that can only be either `ASC` or `DESC`.&#x20;

## Table options

```sql
[ COMMENT = '<comment>' ]
[ STORAGE_CONNECTION = <connection_identifier>
  STORAGE_LOCATION = '<storage_location>' ]
[ COMPUTE_CLUSTER = <cluster_identifier> ]
[ DISABLE_COMPACTION = { TRUE | FALSE } ]
[ RETENTION_COLUMN = <column_name> ]
[ RETENTION_DURATION = <integer> DAYS ]
[ iceberg.<some.iceberg.property> = '<property_value>' ]
```

**Jump to**

* [COMMENT](#comment-editable)
* [STORAGE\_CONNECTION](#storage_connection)
* [STORAGE\_LOCATION](#storage_location)
* [COMPUTE\_CLUSTER](#compute_cluster-editable)
* [DISABLE\_COMPACTION](#disable_compaction-editable)
* [RETENTION\_C](#retention_date_partition)[OLUMN](#retention_column)
* [RETENTION\_DURATION](#retention_duration-editable)
* [ICEBERG\_TABLE\_PROPERTIES](#iceberg_table_properties)&#x20;

#### `COMMENT` — editable

Type: `text`

(Optional) A description or comment regarding this table.

#### `STORAGE_CONNECTION`&#x20;

Type: `identif`i`er`

Default: Default storage connection configured for the metastore connection this table is created under

(Optional) The storage connection associated with the [`STORAGE_LOCATION`](#storage_location) for the table's underlying files.&#x20;

Only a storage type connection can be used here (e.g. S3, Blob storage, GCS, Oracle object storage), and it should match the catalog's metastore. For example, if Glue is used as the metastore, only S3 is allowed as a storage connection.

When set, `STORAGE_LOCATION` must be configured as well to provide a path to store the data.

#### `STORAGE_LOCATION`

Type: `text`

Default: Default storage location configured for the metastore connection this table is created under

(Optional) The storage location for the table's underlying files.&#x20;

For S3, it should be provided in the format `s3://bucket_name/path_to_data`.&#x20;

This option is required when [`STORAGE_CONNECTION`](#storage_connection) is set.&#x20;

When set, `STORAGE_CONNECTION` must be configured as well to provide a connection with access to write to the specified storage location.

#### `COMPUTE_CLUSTER` — editable&#x20;

Type: `identif`i`er`

Default: The sole cluster in your environment

(Optional) The compute cluster that processes the table.

This option can only be omitted when there is just one cluster in your environment. You must specify which one to use when you have more than one compute cluster.

#### `DISABLE_COMPACTION` — editable

Type: `Boolean`

Default: `false`

(Optional) When `true`, disables the compaction process.

#### **`RETENTION_COLUMN`**

Type: `identifier`

(Optional) Data will be deleted from the table based on values in this column.

The column type must be `DATE`, `TIMESTAMP`, `TIMESTAMPTZ`, `LONG`, or `INT`.

For performance and cost reasons, it is preferable that the `RETENTION_COLUMN` is part of the partition columns defined for the table. However, in some use cases, you may want to select a retention column that is not part of your partition columns. In such cases, deletion occurs at the file level. Therefore, data will be deleted only once all rows in the file have passed the [`RETENTION_DURATION`](#retention_duration-editable).

If you select a column of type `LONG` or `INT`, Upsolver will automatically detect the unit (seconds, milliseconds, microseconds, minutes, days) in which the data is stored.

This option is stored in the Iceberg table properties under the key `upsolver.retention.column`.

#### **`RETENTION_DURATION` — editable**

Value: `<integer> DAYS`

(Optional) Data must be at least this many days old to be deleted. Number of days can range between 1 and 9999.

This option functions similarly to lifecycle policies on common blob storage services, such as Amazon S3, and is designed to save storage costs. While it does not delete data immediately upon surpassing the defined threshold, it ensures that data will be deleted at some point in the future. Once data passes the retention period, it will be scheduled for deletion, and although the exact time may vary, Upsolver aims to complete this process within a reasonable timeframe.

This option is stored in the Iceberg table properties under the key `'upsolver.retention.duration'`.

#### **`ICEBERG_TABLE_PROPERTIES`**

You can configure any Iceberg table property as documented in the [Apache Iceberg table properties documentation](https://iceberg.apache.org/docs/latest/configuration/).&#x20;

`iceberg.<some.iceberg.property> = '<property_value>'`

Note:

* The prefix `iceberg.` should be added before the name of the property.
* Properties which include the character `"-"` should be written within double quotes. For example: `iceberg.read.split."planning-lookback" = '10'`
* Iceberg table properties are strings, so the `<property_value>` should be enclosed in apostrophes.&#x20;

## Examples

### Minimum example

```sql
CREATE ICEBERG TABLE default_glue_catalog.my_database.my_iceberg_table();
```

### Partitioned table example

```sql
CREATE ICEBERG TABLE default_glue_catalog.my_database.my_iceberg_table()
    PARTITIONED BY $event_date;
```

### Basic example

```sql
CREATE ICEBERG TABLE default_glue_catalog.my_database.orders_data
(
    order_id string,
    order_date date,
    customer_email string,
    net_total bigint, 
    num_items bigint,
    customer_address string,
    customer_city string,
    customer_state string,
    customer_zipcode bigint
)
    PARTITIONED BY order_date
    RETENTION_COLUMN = order_date
    RETENTION_DURATION = 7 DAYS
    PRIMARY KEY order_id
    ORDER BY net_total DESC, order_date ASC
    STORAGE_CONNECTION = s3_connection
    STORAGE_LOCATION = 's3://bucket/storage_location'
    COMPUTE_CLUSTER = "my cluster"
    DISABLE_COMPACTION = false
    iceberg.read.split."planning-lookback" = '10'
    COMMENT = 'Orders table';
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://upsolver.gitbook.io/content/reference-1/sql-commands/iceberg-tables/upsolver-managed-tables/create-iceberg-table.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
