> For the complete documentation index, see [llms.txt](https://upsolver.gitbook.io/content/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://upsolver.gitbook.io/content/how-to-guides-1/jobs/advanced-use-cases/enriching-data-amazon-s3-to-clickhouse.md).

# Enriching Data - Amazon S3 to ClickHouse

A common use case for creating a transformation job is the need to enrich an existing analytics table with data from production systems or third parties. In this guide, we will use ClickHouse for our target database, but the process for creating a [transformation](/content/reference-1/sql-commands/jobs/create-job/transformation.md) job is almost identical for other data platforms. &#x20;

ClickHouse is an OLAP engine and is the fastest and most efficient open-source database for real-time analytics. Users gain the most benefit when data is loaded into one of its specialized table engines, and ClickHouse Cloud can process, store, and analyze billions of rows in seconds.

While you can use query federation and the ClickHouse native S3 integration to query data, you won't benefit from the performance gains to be made by loading the data directly into ClickHouse, which can return queries in milliseconds. An alternative would be to use the S3 or S3Cluster table functions, which are easy enough to implement, but this ease is generally offset by the manual tuning that is required - along with in-depth knowledge of the data structure - to avoid mis-configuration. In addition, users must create and administer a scheduled process to refresh the data.

Upsolver removes all the manual intervention and complexity, and the need to "figure things out" by offering a simple command to ingest data from S3 into your ClickHouse table.&#x20;

## What you will learn

In this guide, you will learn how to connect to your source and target data. Then you will discover how to create a transformation job that loads the data into the destination table in ClickHouse ready for you to analyze. Once loaded into the table, you will benefit from the performance gains delivered by ClickHouse's table engines. &#x20;

***

## Prerequisites

Upsolver only runs in your AWS Virtual Private Cloud (VPC). Learn how to [Start Your Free Trial](/content/get-started/start-your-free-trial.md) and [Deploy Upsolver on AWS](/content/how-to-guides-1/setup/deploy-upsolver-on-aws.md).

When you first deploy Upsolver in the customer VPC, you create an Identity and Access Management (IAM) role that gives you access to any AWS resource you might need to build data pipelines. See [AWS Role Permissions](/content/how-to-guides-1/setup/deploy-upsolver-on-aws/aws-role-permissions.md) for more information.&#x20;

In this guide, the cluster has permission to read and write to the AWS Glue Data Catalog and to S3.

The steps for enriching an existing table in ClickHouse are as follows:

1. Connect to Amazon S3
2. Connect to ClickHouse
3. Create a transformation job to load the data into ClickHouse&#x20;

***

## Step 1

### Create a connection to your data sources in S3

This connection enables you to configure the [AWS IAM](/content/how-to-guides-1/setup/deploy-upsolver-on-aws/aws-role-permissions.md) credentials that Upsolver needs to access the data. Use your Access Key ID and your Secret Access Key to connect to your specific data lake.

**Here’s the code:**

```sql
CREATE S3 CONNECTION my_s3_connection
    AWS_ROLE = 'arn:aws:iam::111111111111:role/<upsolver-role-*'
    EXTERNAL_ID = '12345678'
    READ_ONLY = TRUE;
```

For future reference, your `AWS_ROLE` can be copied from your [AWS IAM](/content/how-to-guides-1/setup/deploy-upsolver-on-aws/aws-role-permissions.md) user page. After you complete this step, you should find the **my\_s3\_connection** connection in your navigation tree.

Every connection you create is visible and available to all users within your organization, and always on, meaning you don't need to recreate or reconnect when you create subsequent jobs.

## Step 2

### Create a connection to ClickHouse

The next step is to create a connection to your target ClickHouse database.&#x20;

**Here's the code:**

```sql
CREATE CLICKHOUSE CONNECTION my_clickhouse_connection
    CONNECTION_STRING = 'http://x.us-east-1.aws.clickhouse.cloud:8123/sales_db'
    USER_NAME = 'my_username'
    PASSWORD = 'my_password'
    COMMENT = 'Connection to Sales database';
```

In the above example, a connection named **my\_clickhouse\_connection** is created to access the **sales\_db** database. An optional comment is added to inform other users as to what this connection does and this is visible in the [connections system table](/content/reference-1/monitoring/system-catalog/information-schema/connections.md).  &#x20;

## Step 3

### Create a transformation job

Now that you have a connection to both your source and target systems, you can create a job to transform and load the data into ClickHouse.

**Here's the code to create a transformation job:**&#x20;

```sql
CREATE SYNC JOB transform_and_load_to_clickhouse
  START_FROM = BEGINNING
  RUN_INTERVAL = 1 MINUTE
  RUN_PARALLELISM = 20
AS INSERT INTO CLICKHOUSE my_clickhouse_connection.sales_db.target_tbl 
  MAP_COLUMNS_BY_NAME
    SELECT 
      orders,
      MD5(customer_email) AS customer_id,
      ARRAY_SUM(data.items[].quantity * data.items[].price) AS net_total
    FROM default_glue_catalog.lake_db.raw_order_events
    WHERE TIME_FILTER(); 
```

In the example above, we create a job named **transform\_and\_load\_to\_clickhouse**, which we then customize using some of the available job options. The first option, `START_FROM`, instructs Upsolver on what point in time to start ingesting the data from. We have specified **BEGINNING**, so all data will be loaded into ClickHouse. You can also specify a date, or **NOW** to only ingest data starting from when the job starts running.&#x20;

Next, the `RUN_INTERVAL` option tells the job to load changes into the target every **1 MINUTE**, though this is a highly configurable option, and can be set to your exact requirements.

The `RUN_PARALLELISM` option controls how many jobs run in parallel to process each minute of data from the source table, and has been set to **20**.

Then, the `INSERT INTO` statement defines where we want to load the data; in our case we want the **target\_tbl** table in the **sales\_db** database, using the **my\_clickhouse\_connection** connection we created in [Step 2](#step-2).

The second half of the query defines the data we want to transform and load. The `MAP_COLUMNS_BY_NAME` keyword specifies that the columns in the `SELECT` list must map to the same column names in the target. If we excluded this keyword, columns would be mapped on their ordinal position.

The `SELECT` statement defines the columns we want to load into the **target\_tbl** table. To protect sensitive data, the **customer\_email** column has been masked using the `MD5()` hashing function, so the value will not be readable in the target. The `ARRAY_SUM()` function has been used to derive the **net\_total** value of the order.&#x20;

Finally, in the `WHERE` clause, we include the `TIME_FILTER()` function to specify the window over which the aggregation is calculated. As we have not passed a value in to the function, it takes the default specified in the `RUN_INTERVAL` option, in our case, **1 MINUTE**.&#x20;

This is a simple example to demonstrate how to create a transformation job to enrich your target table, however, Upsolver's library includes an extensive list of [Functions](/content/reference-1/functions-and-operators/functions.md) and [Operators](/content/reference-1/functions-and-operators/operators.md) that can be applied to your jobs to ensure your data lands in your target exactly how you want it.

***

## Conclusion

In this guide you learned that, in some cases, it is best to load data into ClickHouse to benefit from the super fast query performance offered by a database specifically geared up for high-scale analytics. You learned how to create a connection to your Amazon S3 source, a connection to your target ClickHouse database, and then how to create a transformation job to continuously load the data.  &#x20;

***

## Try it yourself

To create a transformation job to load your data into ClickHouse:

1. Create a connection to your [Amazon S3](/content/reference-1/sql-commands/connections/create-connection/amazon-s3.md) source
2. Create a connection to your [ClickHouse](/content/reference-1/sql-commands/connections/create-connection/clickhouse.md) database
3. Write a [transformation job](/content/reference-1/sql-commands/jobs/create-job/transformation/job-options/clickhouse.md) to copy data from S3 to ClickHouse using Upsolver's  [Functions](/content/reference-1/functions-and-operators/functions.md) and [Operators](/content/reference-1/functions-and-operators/operators.md) to transform your data
