> For the complete documentation index, see [llms.txt](https://upsolver.gitbook.io/content/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://upsolver.gitbook.io/content/articles-1/data/schema-evolution/iceberg-schema-evolution.md).

# Iceberg Schema Evolution

Schema evolution is critical for systems that need to handle data that can change over time. The rules outlined below provide a structured approach to managing varying types and dimensions of data for a field, as it evolves from one data type to another.&#x20;

The guiding principle is to have all data saved to storage without losing precision, enabling users to identify cases where their data types differ for a given field. This approach ensures that data integrity is maintained while accommodating changes in data types and structures, and is essential in environments where the schema is dynamic.

## Guiding Principles

1. All data is written to the table.
2. Data doesn't lose precision when it written to the table.
3. When data is ingested with conflicting data types, it is written in a way that enables users to identify and fix changes. Essentially, Upsolver provides full flexibility to manage dynamic changes.

## End Result

Assuming the data contains a field with varying types, Upsolver attempts to define a more suitable type. However, when this fails, Upsolver adds an extra field with the type suffix to identify the evolved type. You can use the [Written Data Statistics](/content/reference-1/monitoring/datasets/ingested-data.md#written-data-statistics) report in [Datasets](/content/reference-1/monitoring/datasets.md) to understand your table. For example:

<figure><img src="/files/fUokJkr7ahc5qPf0FR54" alt=""><figcaption><p>Written Data Statistics in Datasets displays evolved columns</p></figcaption></figure>

In this case, the column **my\_field** was initially discovered with numeric values and, later on, string values were found. Therefore, a new field named **my\_field\_string** was defined and will be populated with all future values. The original field, **my\_field**, will be populated when the values are numeric or, more specifically, whole numbers.

When this situation arises, you can decide how to handle the varying types. For instance, one solution would be to modify queries on this table to use a `COALESCE` function:

```sql
COALESCE(my_field_string, CAST(my_field AS VARCHAR)) AS my_field
```

## How Schema Evolution Behaves

1. **Initial Field Creation**: When a field is encountered for the first time, the system uses the widest type and largest dimension. For example, if the first data type for field `a` is `[][][]`, this becomes the main representation. Please note that when you create a table in Upsolver you can define fields as part of the table creation. That definition always precedes any fields Upsolver might automatically detect in the data.
2. **Handling Narrower Types**: If a narrower data type is later encountered for an existing field, it is converted to fit into the existing field. For example, if `a::long` is discovered when `a::string` is already defined in the table, the long data is converted to a string. See the [Reference](#reference-calculating-the-widest-type-for-primitives) below describing how this works.
3. **Wider or Non-Convertible Types**: If a wider or non-convertible type is encountered, a new field is created for this type. For example, if `a::boolean` already exists and `a::date` is now added, new fields for `a_string::string` and `a_date::date` are created. The reasoning behind these two fields is:
   1. We do not want to alter the existing field, so it should remain as it is and will be populated when Boolean values are found in the data.
   2. `a_string::string` is added as a common denominator field into which all types of primitives can be written.
   3. `a_date::date` is added only for date values, as specific type fields are always added to make it easier to address any potential data cleaning.
4. **Records Vs. Arrays Vs. Primitives**: The order of precedence when deciding which type receives the original name is as follows:
   1. Regardless of type, an existing field always retains its name. This means that if in a task time of the related job we see data with one type, and in a later task time we encounter another type, the first data type will already be used for the column that is created in Iceberg. It remains as is, and the next data type might trigger the creation of an additional field.
   2. Records.
   3. Arrays - max dimension and widest type.
   4. Primitives - widest type (a private case of an array with 0 dimensions).
5. **Field Naming Conventions**: The naming convention includes a suffix indicating the type and dimension. For instance, `coordinates[][][]:bigint` becomes `coordinates_array3_long`.
6. **Names Must be Non Empty.** Although the Iceberg spec allows empty names (an empty string), many query engines will fail if you have such fields. Therefore, Upsolver will drop any such fields.
7. **Field Documentation**: When a field is added with a type suffix, Upsolver also creates a field documentation in the form of `upsolver_evolved_from:<field_name>`. For instance, let's say we first see a field named **size**, with the value **4**. We would define it with a type of `bigint`, which translates to `long` in Iceberg. Then, during the next task time, the data contains the same field with a value of **2.3**. Upsolver then adds a field named **size\_double** with type `double`, and the field documentation, **upsolver\_evolved\_from:size**. Upsolver uses this documentation to distinguish between cases where a type suffix exists because that is how the field was defined, e.g. **start\_date** of type `date`, and cases such as the above example where it was added due to schema evolution.
8. **Routing of Values**: Values are routed to fields with the exact type and to all other compatible fields. This means converting to a wider type as well as an array with a greater dimension. So, for example, if we have the following types for the same field:

   1. `long`
   2. `date`
   3. `string`
   4. `date[]`
   5. `string[]`

   Then all values can, and will, be routed to the last type of `string[]`.
9. **Edge Case #1: Handling Double and Long Types**: When a new field is encountered for the first time and has both `double` and `long` types, only the `double` field is used, and no extra field for long values is needed. This is a special exception to the other rules.
10. **Edge Case #2: Handling Timestamp and Long Types**: Contrary to the general rule of maintaining precision, `long` values can be converted to `timestamps`. In practice, this will happen when a field first appears as a timestamp and then as a long. Please note, this only refers to routing of data, meaning long data can be written into timestamp fields when they exist for the same field. When you have the two types together within the same task time for the same field, you will still get two fields, unlike the case of long and double. You also end up with two fields if the timestamp field is found before the long field, but not vice versa. For example, task time `t` has the field **started** as a `timestamp` field. Then task time `t+1` has it as `long`. A new field named **started\_long** will be defined, but **started** will also be populated.

## Reference - Calculating the Widest Type for Primitives

### Guiding Principles

The general principle is that types are converted to wider types so precision is not lost. An exception to this rule is allowing `bigint` types to be converted into `timestamp` values, because this conversion is ubiquitous.

Furthermore, for some type combinations, such as `date` and `Boolean,` neither of the types can be considered wider than the other. In such cases, a `varchar` type is used as the common denominator.

| Original Type    | Wider Than           | Narrower Than                   | Incompatible With |
| ---------------- | -------------------- | ------------------------------- | ----------------- |
| Boolean          | Nothing              | Long, Double, String            | Date, Timestamp   |
| Long (Bigint)    | Boolean, Timestamp\* | Double, String                  | Date, Timestamp   |
| Double           | Boolean, Long        | String                          | Date, Timestamp   |
| Date             | Nothing              | Long, Double, Timestamp, String | Boolean           |
| Timestamp        | Date                 | Long, Double, String            | Boolean           |
| String (Varchar) | Everything else      | Nothing                         | Nothing           |

\*While a `long` value is not narrower than a `timestamp`, it can be converted into a `timestamp`.

## Examples

### Example 1

In the first interval of data, the field `a[]::long` exists. Then, in a later interval, the data contains the field `a::string`.

The widest type between `long` and `string` is `string`

The largest dimension between **0**, for the primitive `a::string,` and **1** for the array `a[]::long`, is **1**.

Therefore two new fields are defined:

1. `a_string::string` - this will contain future values that have a single dimension. Because `string` is the widest primitive type, this means that even if the data suddenly contains `a::date` or `a::boolean`, for example, the Boolean will be converted into string and stored in this field.
2. `a_array_string[]::string` - all future values that are either primitive or arrays of any primitive type with a dimension of **1** will be converted and stored into this field.

Furthermore, the original field still exists and continues to be populated with values.

Future values that can be converted to `long` and are contained in an array with a dimension of **1**, or a primitive that has a type that is convertible to long, will be stored in the original field `a[]::long`.

### Example 2

All of the following fields arrive together for the first time:

```sql
a.b::string
a.b[]::string
a.b[][]::string
a.b[][][]::string
```

The following fields will be defined, and the largest dimension acquires the original name:

```sql
a.b_string::string
a.b_array_string[]::string
a.b_array2_string[][]::string
a.b[][][]::string
```

Explanation: the largest dimension takes the original name, but a field for each discovered type is defined.

### Example 3

The following fields arrive together for the first time:

```
a.b::boolean
a.b::long
a.b::double
```

The following fields will be defined, and the widest type can accept all values:

```
a.b::boolean
a.b::double
```

Explanation: we define a field for each type discovered, except when finding a `long` and a `double`, in which case only a `double` is defined.

### Example 4

Data arrives with the following fields:

```
a.b.c::string
a.b[].x::string
a.b[][].y::string
a.b[][][].z::string
```

The following fields are defined:

```
a.b[][][].c::string
a.b[][][].x::string
a.b[][][].y::string
a.b[][][].z::string
a.b_record.c::String
a.b_array_record[].x::string
a.b_array2_record[][].y::string
```

Explanation: the largest dimension receives all nested fields. Smaller dimensions then have an extra field specific to where the data appeared originally when ingested by Upsolver.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://upsolver.gitbook.io/content/articles-1/data/schema-evolution/iceberg-schema-evolution.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
