> For the complete documentation index, see [llms.txt](https://upsolver.gitbook.io/content/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://upsolver.gitbook.io/content/articles-1/get-started/core-concepts/schema-detection-and-evolution.md).

# Schema Detection and Evolution

## Schema detection <a href="#uvts462fnmr" id="uvts462fnmr"></a>

### Data ingestion <a href="#n0e3y2cvsnmk" id="n0e3y2cvsnmk"></a>

When you create a table in Upsolver to stage your data, you have the option of defining column names with specified types, or you can create a table with no columns.

Regardless, the ingestion job dynamically expands the staging table's column list as new columns are encountered during ingestion and Upsolver automatically infers the column type.

The primary difference between the explicitly defined columns and the dynamically added columns lies in how Upsolver treats its data types when new data arrives in an unexpected type.

See [Schema evolution](#id-94c1ptj58hfe).

Note that column names are case-insensitive. Therefore if you have multiple columns with the same name in different cases (e.g. **column\_name**, **Column\_Name**, **COLUMN\_NAME**, etc.), it is non-deterministic as to which column's data ends up in your table.

### Data transformation <a href="#id-965fn6zhik1p" id="id-965fn6zhik1p"></a>

For Upsolver transformation jobs writing to data lake tables, there is a job option named `ADD_MISSING_COLUMNS` that allows you to configure whether or not the target table's column list should be dynamically expanded based on the data being written.

By default, this option is **FALSE**, meaning that any column written by the job must already be defined within the target table; otherwise, the job fails. Furthermore, you cannot have a `SELECT *` statement within your transformation job. This applies to jobs writing to Snowflake tables, where `ADD_MISSING_COLUMNS` is not an available job option.

However, when `ADD_MISSING_COLUMNS` is **TRUE**, the target table's columns list expands as new columns are encountered in the same way as data ingestion.

Note that column names are case-insensitive. This means that if you have multiple columns with the same name in different cases (e.g. **column\_name**, **Column\_Name**, **COLUMN\_NAME**, etc.), it is non-deterministic as to which column's data ends up in your table.

## Schema evolution <a href="#id-94c1ptj58hfe" id="id-94c1ptj58hfe"></a>

### Data ingestion <a href="#ovultg3g1shi" id="ovultg3g1shi"></a>

#### Unexpected data type <a href="#jskoitqrd65v" id="jskoitqrd65v"></a>

There are times when you may have unexpected values arrive in your raw data; for example, the values in your staging table's **nettotal** column have historically been doubles, but suddenly the **nettotal** from a new event that arrives is a string.

If you had explicitly defined the **nettotal** column type to be a double when creating the staging table, then the column type does not change and the new event's **nettotal** shows up as null when you query the data.

However, Upsolver stores all of your data in its original type; this means that you are able to recover the null value by creating a transformation job that casts the **nettotal** column to a string.

If this **nettotal** column had not been explicitly defined and was instead added dynamically through Upsolver's schema detection where the data type was inferred, in this case, the **nettotal** column is automatically upcast to be a string instead of a double.

The order of Upsolver's upcast is as follows:

`Boolean` → `date` → `timestamp` → `bigint` → `decimal` → `double` → `string`

Note that to have a decimal type column, it must be defined when creating the table; Upsolver does not infer data to be of that type.

Additionally, if a column contains Boolean values as well as date or timestamp values, then the column is upcast directly to a string as there is no appropriate cast between them.

#### Column name change <a href="#lroutmxweszu" id="lroutmxweszu"></a>

As mentioned earlier, column names in Upsolver are case insensitive. Therefore if you have multiple columns with the same name in different cases (e.g. **column\_name**, **Column\_Name**, **COLUMN\_NAME**, etc.), it is non-deterministic as to which column's data ends up in your table.

However, this also means that for any column name changes in your raw data, so long as it is only a case-wise change (e.g. from **column\_name** to **Column\_Name**), then your staging table remains unaffected.

If the change is beyond a case change (e.g. from **column\_name** to **columnname**), then it is added as a new column in your staging table.

Note that you can easily combine these old and new columns by using `COALESCE` in a transformation job.

### Data transformation <a href="#dapcy3stzqjk" id="dapcy3stzqjk"></a>

#### Unexpected data type <a href="#kh20kqzd5r4" id="kh20kqzd5r4"></a>

When performing a simple `SELECT` statement in your transformation job (meaning that a column is only selected without having any additional transformations applied), data that arrives in an unexpected type is handled the same way as staging jobs.

Take for example the case where values in your **nettotal** column have historically all been doubles, but suddenly the **nettotal** from a new event arrives as a string.

If you had explicitly defined the **nettotal** column type to be double when creating the target table, then the column type does not change and the new event's **nettotal** shows up as null when you query the data.&#x20;

However, Upsolver stores all of your data in its original type, meaning that you are able to recover the null value by casting the **nettotal** column to a string in another transformation job.

If this **nettotal** column had not been explicitly defined and was instead dynamically added through Upsolver's schema detection where the data type was inferred, in this case, the `nettotal` column is automatically upcast to be a string instead of a double.

The order of Upsolver's upcast is as follows:

`Boolean` → `date` → `timestamp` → `bigint` → `decimal` → `double` → `string`

Note that to have a decimal type column, it must be defined when creating the table; Upsolver does not infer data to be of that type.

Additionally, if a column contains Boolean values as well as date or timestamp values, then the column is upcast directly to a string as there is no appropriate cast between them.

However, in the case where columns that receive unexpected values have additional transformations applied on top of them, Upsolver's behavior differs slightly. This is due to the fact that certain transformations expect certain data types.

Looking again at our **nettotal** column example, let's say that the transformation job contains `ABS(`**`nettotal`**`)`. Even if the **nettotal** column type had not been explicitly defined (meaning that Upsolver would typically upcast it to a string due to the new data that arrived), since the `ABS` function requires a numeric input, Upsolver continues to cast the data as a double in order to perform the transformation.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://upsolver.gitbook.io/content/articles-1/get-started/core-concepts/schema-detection-and-evolution.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
