IMPORT (Experimental)

On this page Carat arrow pointing down
Warning:
CockroachDB v1.1 is no longer supported as of April 12, 2019. For more details, refer to the Release Support Policy.

The IMPORT statement imports tabular data (e.g., CSVs) into a single table.

Warning:
This is an experimental feature. To enable it, you must run SET CLUSTER SETTING experimental.importcsv.enabled = true
Note:
For details about importing SQL dumps, see Import Data.

Glossary

Term Definition
Import file The tabular data file you want to import.
Processing node The single node processing the IMPORT statement/
Temp directory A location where the processing node can store data from the import file it converts to CockroachDB-compatible key-value data.

This directory must be available to all nodes using the same address (i.e., cannot use the processing node's local file storage).

Functional Overview

Because importing data is a complex task, it can be useful to have a high-level understanding of the process.

  1. A single node receives the IMPORT request, which becomes the processing node.
  2. The processing node streams the contents of the import file, converting its contents into CockroachDB-compatible key-value data.
  3. As the key-value data is generated, the node stores it in the temp directory.
  4. Once the entire import file has been converted to key-value data, relevant nodes import key-value data from the temp directory.

After the import has completed, you should also delete the files from your temp directory.

Preparation

Before using IMPORT, you should have:

  • The schema of the table you want to import.
  • The tabular data you want to import (e.g., CSV), preferably hosted on cloud storage.
  • A location to store data before it is fully imported into all your nodes (referred to in this document as a "temp" directory). This location must be accessible to all nodes using the same address (i.e., cannot use a node's local file storage).

    For ease of use, we recommend using cloud storage. However, if that isn't readily available to you, we also have a guide on easily creating your own file server.

Details

Import Targets

Imported tables must not exist and must be created in the IMPORT statement. If the table you want to import already exists, you must drop it with DROP TABLE.

You can only import a single table at a time.

Create Table

Your IMPORT statement must include a CREATE TABLE statement (representing the schema of the data you want to import) using one of the following methods:

  • A reference to a file that contains a CREATE TABLE statement
  • An inline CREATE TABLE statement

We also recommend all secondary indexes you want to use in the CREATE TABLE statement. It is possible to add secondary indexes later, but it is significantly faster to specify them during import.

Object Dependencies

When importing tables, you must be mindful of the following rules because IMPORT only creates single tables which must not already exist:

  • Objects that the imported table depends on must already exist
  • Objects that depend on the imported table can only be created after the import completes

Operational Requirements & Concerns

Because IMPORT has a number of moving parts, there are a number of operational concerns in executing the statement, the most important of which is ensuring that the processing node can execute IMPORT successfully.

Choose Node to Process Request

Because of IMPORT's current implementation, the entire task is executed on a single node. If your deployment is not entirely symmetric, sending the request to a random node might have undesirable effects. Instead, we recommend bypassing any load balancers, connecting to a machine directly, and running the IMPORT statement on it.

It's important to note, though, that after the single machine creates the CockroachDB-compatible key-value data, the process of importing the data is distributed among nodes in the cluster.

Note:
Future versions of IMPORT will let you distribute the entire process among many nodes.

Available Storage Requirements

The node's first-listed/default store directory must have enough available storage equal to or greater than the size of the file you're importing.

On cockroach start, if you set --max-disk-temp-storage, it must also be greater than the size of the file you're importing.

For example, if you're importing approximately 10GiB of data, the node that ends up running the IMPORT command must have at least 10GiB of available storage in its store directory.

Import File Location

You can store the tabular data you want to import using either a node's local storage or remote cloud storage (Amazon S3, Google Cloud Platform, etc.).

For simplicity's sake, we highly recommend using cloud/remote storage for the data you want to import.

However, if you do want to store the file locally to import it, there are a number of things to understand.

Importing Data From Local Storage

Note:
Because you must have remote/cloud storage available to complete the IMPORT process, we recommend using it instead of local file storage.

If you do not have access to cloud storage, you can easily create a file server using this guide.

Because CockroachDB is designed as a distributed system, the ergonomics of local file storage require some understanding to use successfully. Though we do not recommend this process, if you do want to use a locally stored file, this procedure is likely to cause you the fewest headaches:

  1. Ensure the node you want to use has available storage space at least 2x the size of the data you want to import; 1x for the file itself, and 1x for the converted key-value data.

    For example, if you want to import 10GiB of data, your node needs 20GiB of available storage.

  2. Upload the tabular data file to a single node, and then connect to that node.

  3. Execute the IMPORT statement, importing to the locally stored file with the nodelocal prefix, e.g., nodelocal://backup.csv.

    However, the "temp" directory you choose must use a location available to all nodes in the cluster (i.e., you cannot use local file storage). You will need to use either cloud storage, a custom HTTP server, or NFS connected to all nodes in the cluster.

Temp Directory

To distribute the data you want to import to all nodes in your cluster, the IMPORT process requires the CockroachDB-compatible key-value data be stored in a location that is accessible to all nodes in the cluster using the same address. To achieve this you can use:

  • Cloud storage, such as Amazon S3 or Google Cloud Platform
  • Network file storage mounted to every node
  • HTTP file server
Note:
If you do not currently have any of these options available, you can easily create a file server.

The temp directory must have at least as much storage space as the size of the data you want to import.

Temp Directory Cleanup

After completing the IMPORT process, you must manually remove the key-value data stored in the temp directory.

Table Users and Privileges

Imported tables are treated as new tables, so you must GRANT privileges to them.

Performance

Currently, IMPORT uses a single node to convert your tabular data into key-value data, which means the node's CPU and RAM will be partially consumed by the IMPORT task in addition to serving normal traffic.

Later steps of the import process distribute work among many nodes and have less impact on the nodes' resources.

Synopsis

IMPORT TABLE table_name CREATE USING create_table_file ( table_elem_list ) CSV DATA ( file_to_import , ) WITH kv_option ,

Required Privileges

Only the root user can run IMPORT.

Parameters

Parameter Description
table_name The name of the table you want to import/create.
create_table_file The URL of a plain text file containing the CREATE TABLE statement you want to use (see this example for syntax).
table_elem_list The table definition you want to use (see this example for syntax).
file_to_import The URL of the file you want to import.
WITH kv_option Control your import's behavior with these options. The temp option (which represents the temp directory's URL) is required.

Import File & Temp Directory URLs

URLs for the file you want to import and your temp directory must use the following format:

[scheme]://[host]/[path]?[parameters]
Location scheme host parameters
Amazon S3 s3 Bucket name AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
Azure azure Container name AZURE_ACCOUNT_KEY, AZURE_ACCOUNT_NAME
Google Cloud 1 gs Bucket name N/A
HTTP http Remote host N/A
NFS/Local 2 nodelocal File system location N/A

Considerations

  • 1 GCS connections use Google's default authentication strategy.

  • 2 Because CockroachDB is a distributed system, you cannot meaningfully store backups "locally" on nodes. The entire backup file must be stored in a single location, so attempts to store backups locally must point to an NFS drive to be useful.

  • The location parameters often contain special characters that need to be URI-encoded. Use Javascript's encodeURIComponent function or Go language's url.QueryEscape function to URI-encode the parameters. Other languages provide similar functions to URI-encode special characters.

Notes

1 Only supports instance auth.

2 You can easily create your own HTTP server with Caddy or nginx.

3 If using NFS for your temp directory, each node in the cluster must have access to the NFS using the same URL.

Import Options

You can control the IMPORT process's behavior using any of the following key-value pairs as a kv_option.

temp

A directory accessible by all nodes, which is used to store the CockroachDB-compatible key-value data before all nodes import the data.

Required? Yes
Key temp
Value The URL of the temp directory
Example WITH temp = 'azure://acme-co/import-temp?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co'

delimiter

If not using comma as your column delimiter, you can specify another Unicode character as the delimiter.

Required? No
Key delimiter
Value The unicode character that delimits columns in your rows
Example To use tab-delimited values: WITH temp = '...', delimiter = e'\t'

comment

Do not import rows that begin with this character.

Required? No
Key comment
Value The unicode character that identifies rows to skip
Example WITH temp = '...', comment = '#'

nullif

Convert values to SQL NULL if they match the specified string.

Required? No
Key nullif
Value The string that should be converted to NULL
Example To use empty columns as NULL: WITH temp = '...', nullif = ''

Examples

Use Create Table Statement from a File

> IMPORT TABLE customers
CREATE USING 'azure://acme-co/customer-create-table.sql?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co'
CSV DATA ('azure://acme-co/customer-import-data.csv?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co')
WITH
    temp = 'azure://acme-co/temp/?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co'
;

Use Create Table Statement from a Statement

> IMPORT TABLE customers (
        id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
        name TEXT,
        INDEX name_idx (name)
)
CSV DATA ('azure://acme-co/customer-import-data.csv?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co')
WITH
    temp = 'azure://acme-co/temp/?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co'
;

Import a Tab-Separated File

> IMPORT TABLE customers (
        id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
        name TEXT,
        INDEX name_idx (name)
)
CSV DATA ('azure://acme-co/customer-import-data.tsc?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co')
WITH
    temp = 'azure://acme-co/temp/?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co',
    delimiter = e'\t'
;

Skip Commented Lines

> IMPORT TABLE customers (
        id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
        name TEXT,
        INDEX name_idx (name)
)
CSV DATA ('azure://acme-co/customer-import-data.tsc?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co')
WITH
    temp = 'azure://acme-co/temp/?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co',
    comment = '#'
;

Use Blank Characters as NULL

> IMPORT TABLE customers (
        id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
        name TEXT,
        INDEX name_idx (name)
)
CSV DATA ('azure://acme-co/customer-import-data.tsc?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co')
WITH
    temp = 'azure://acme-co/temp/?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co',
    nullif = ''
;

See Also


Yes No
On this page

Yes No