Creating an external table in Databricks can be done using various methods: via the Databricks UI, using Databricks SQL, or through APIs. Below, we explore each method to help you get started efficiently.
1. Creating an External Table via Databricks UI
The Databricks UI offers a straightforward way to create external tables without needing to write SQL code manually. Here’s how you can do it:
Step-by-Step Process:
Open the Databricks Workspace:
- Navigate to your Databricks workspace and select the Data tab on the left-hand menu.
Click on “Create Table”:
- Choose the option to Create Table, and select From Data Lake. This allows you to select files directly from Azure Data Lake Storage (ADLS), Amazon S3, or other supported cloud storage services.
Select Your Data Source:
- Choose the cloud storage location where your files are stored (e.g., Azure Data Lake).
- Browse or Enter the Path manually to point to your data files.
Define the Schema:
- Databricks will automatically infer the schema based on the selected data. You can manually adjust column types if needed.
Configure Table Options:
- Assign a name to your table, select the format (CSV, Parquet, JSON, etc.), and set additional options like whether the file has headers.
Confirm and Create:
- Click Create Table. Databricks will create the table and display it under the Data tab.
For further details on external tables, you can also refer to Microsoft’s official documentation on external tables.
2. Creating an External Table Using Databricks SQL
For those who prefer SQL commands, Databricks allows you to create external tables using SQL directly in a notebook or Databricks SQL Editor.
Example Code:
CREATE TABLE external_table_example ( id INT, name STRING, purchase_date DATE, amount DOUBLE ) USING parquet OPTIONS ( path 'abfss://your-container@your-storage-account.dfs.core.windows.net/data-folder/', header 'true', inferSchema 'true' );
Steps:
- Open a Databricks Notebook or SQL Editor.
- Adjust the
path
and other options based on your storage location and file format. - Run the SQL command to create your external table.
Tip: Refresh the table regularly with:
REFRESH TABLE external_table_example;
This ensures any new files are recognised.
3. Creating an External Table Using Databricks APIs
Databricks APIs enable external table creation, suitable for automation. Below is a Python example using the REST API:
import requests import json databricks_instance = 'https://your-databricks-instance' api_token = 'Bearer your-api-token' api_url = f'{databricks_instance}/api/2.0/sql/tables' table_config = { "name": "external_orders_data", "schema": "default", "data_source": "abfss://your-container@your-storage-account.dfs.core.windows.net/orders-data/", "format": "parquet", "options": { "mergeSchema": "true", "path": "your-container/path" } } headers = { "Authorization": api_token, "Content-Type": "application/json" } response = requests.post(api_url, headers=headers, data=json.dumps(table_config)) if response.status_code == 200: print("External table created successfully!") else: print(f"Error: {response.status_code} - {response.text}")
Replace placeholders like <your-databricks-instance>
and <your-api-token>
with your details.
Conclusion
Databricks offers various methods to create and manage external tables, ensuring flexibility for different workflows. Whether you prefer using the UI, SQL, or APIs, these methods provide scalable and efficient ways to handle large datasets stored in cloud storage solutions like Azure Data Lake or AWS S3. By following these best practices and optimizing for performance, you can ensure efficient data management and a streamlined process.