Introduction
When working with Databricks, dbutils
commands provide an easy interface for interacting with the file system, managing secrets, executing notebooks, and handling widgets. Coupled with PySpark and SQL, they form a powerful combination for managing and processing large-scale data. In this blog, we’ll cover the most useful dbutils
commands and best practices for using PySpark and SQL in Databricks notebooks.
1. Databricks dbutils
Cheat Sheet
File System (FS) Commands:
dbutils.fs
commands allow you to interact with the file system, such as listing directories, creating directories, reading files, etc.
# List files and directories:dbutils.fs.ls("/mnt/datasets")
# Create a directory:dbutils.fs.mkdirs("/mnt/my-directory")
# Remove a directory or file:dbutils.fs.rm("/mnt/my-directory", recurse=True)
# Read the first bytes of a file:dbutils.fs.head("/mnt/my-file.txt", maxBytes=1024)
# Put data into a file:dbutils.fs.put("/mnt/my-file.txt", "This is my data", overwrite=True)
Widgets Commands:
dbutils.widgets
are used to create input forms in notebooks, making parameterization easy.
# Create a text widget:dbutils.widgets.text("input", "default_value", "Input Widget")
# Create a dropdown widget:dbutils.widgets.dropdown("dropdown", "Option1", ["Option1", "Option2"], "Select an Option")
# Retrieve the value of a widget:dbutils.widgets.get("dropdown")
# Remove a widget:dbutils.widgets.remove("dropdown")
Notebook Commands:
dbutils.notebook
commands allow you to chain and organize your notebooks into modular workflows.
# Run another notebook:dbutils.notebook.run("/Users/your-notebook", 60, {"param": "value"})
# Exit a notebook with a return value:dbutils.notebook.exit("Success")
Secrets Commands:
dbutils.secrets
allow secure storage and retrieval of credentials such as API keys, passwords, etc.
# Get a secret from a scope:dbutils.secrets.get(scope = "my-scope", key = "my-key")
# List all secrets in a scope:dbutils.secrets.list("my-scope")
Summary of dbutils
Commands
The provided table offers a comprehensive cheat sheet for dbutils
, a utility in Databricks that simplifies various operations within a workspace. Each category of commands is broken down to facilitate easy reference for users looking to enhance their productivity and streamline workflows.
File System Commands (fs): These commands allow users to interact with the Databricks file system (DBFS), enabling tasks such as listing files, creating directories, and removing unwanted files. For example,
dbutils.fs.ls(path)
lists all files and directories at the specified path, making it easier to navigate the file system.Widgets: This section covers commands for creating interactive input fields in notebooks. Widgets such as text inputs and dropdowns enhance user interactivity and allow for dynamic data entry. An example command,
dbutils.widgets.text(name, defaultValue, label)
, creates a text input for users to enter data.Notebook Commands: Users can execute other notebooks programmatically using
dbutils.notebook.run(path, timeoutSeconds, arguments)
, which is essential for modularizing workflows and improving code reusability.Secrets Management: This category includes commands for securely handling sensitive information, such as API keys and passwords. For instance,
dbutils.secrets.get(scope, key)
retrieves secret values, ensuring that sensitive data remains protected.Utilities: The utilities section features commands for managing libraries within Databricks. For example,
dbutils.library.install(library)
allows users to install required libraries quickly, which is crucial for data processing tasks.
Overall, this cheat sheet serves as a valuable resource for Databricks users, providing quick access to essential commands that facilitate effective data management and workflow optimization.
Category | Command | Description | Example |
---|---|---|---|
File System (fs) | dbutils.fs.ls(path) |
Lists files and directories in a specified path. | dbutils.fs.ls("/databricks-datasets") |
File System (fs) | dbutils.fs.mkdirs(path) |
Creates a directory at the given path. | dbutils.fs.mkdirs("/mnt/my-directory") |
File System (fs) | dbutils.fs.rm(path, recurse=True) |
Removes the file or directory at the specified path. Use recurse=True for directories. |
dbutils.fs.rm("/mnt/my-directory", recurse=True) |
File System (fs) | dbutils.fs.head(path, maxBytes) |
Displays the first maxBytes of the file at the specified path. |
dbutils.fs.head("/mnt/my-file.txt", maxBytes=100) |
File System (fs) | dbutils.fs.put(path, data, overwrite=True) |
Writes the specified data to a file at the given path. Overwrites if overwrite=True . |
dbutils.fs.put("/mnt/my-file.txt", "Hello Databricks!", overwrite=True) |
Widgets | dbutils.widgets.text(name, defaultValue, label) |
Creates a text input widget. | dbutils.widgets.text("input_text", "default_value", "Input Text") |
Widgets | dbutils.widgets.dropdown(name, defaultValue, choices, label) |
Creates a dropdown widget with predefined choices. | dbutils.widgets.dropdown("dropdown", "Option1", ["Option1", "Option2"], "Select an Option") |
Widgets | dbutils.widgets.get(name) |
Gets the current value of the widget. | dbutils.widgets.get("dropdown") |
Widgets | dbutils.widgets.remove(name) |
Removes a widget by name. | dbutils.widgets.remove("dropdown") |
Notebook | dbutils.notebook.run(path, timeoutSeconds, arguments) |
Runs a notebook located at the given path. | dbutils.notebook.run("/Users/my-notebook", 60, {"arg1": "value"}) |
Notebook | dbutils.notebook.exit(value) |
Exits the current notebook and returns the specified value. | dbutils.notebook.exit("Completed") |
Secrets | dbutils.secrets.get(scope, key) |
Retrieves the secret value for a given key in the specified scope. | dbutils.secrets.get("my-scope", "my-key") |
Secrets | dbutils.secrets.list(scope) |
Lists all secrets within the specified scope. | dbutils.secrets.list("my-scope") |
Utilities | dbutils.library.install(library) |
Installs a library on the cluster. | dbutils.library.install("com.databricks:spark-csv_2.10:1.2.0") |
Utilities | dbutils.library.restartPython() |
Restarts the Python process for the current notebook. | dbutils.library.restartPython() |
Utilities | dbutils.help() |
Displays help for all available dbutils commands. | dbutils.help() |
2. PySpark Best Practices Cheat Sheet
Using PySpark within Databricks can be incredibly powerful, but it’s important to follow best practices to ensure code scalability and performance.
Basic PySpark Operations:
# Create a DataFrame from a CSV file:df = spark.read.csv("/mnt/datasets/sample.csv", header=True, inferSchema=True)
# Display the first few rows of the DataFrame:df.show(5)
# Select columns from a DataFrame:df.select("column1", "column2").show()
Best Practice: Avoid using collect()
Avoid using collect()
as it brings all data to the driver node, which can cause memory issues.
# Bad practice - using collect() to bring all data to driver:data = df.collect()
# Better practice - use show() or take() instead:df.show(5)
Repartitioning DataFrames:
Repartition your DataFrames to optimize performance when dealing with large datasets.
# Repartition the DataFrame based on a column:
df = df.repartition("column_name")
3. SQL Best Practices Cheat Sheet
SQL is widely used in Databricks for data querying and transformation. Below are some best practices to keep your queries optimized.
Common SQL Operations:
# Creating a table from a DataFrame:df.createOrReplaceTempView("temp_table")
# Running a SQL query on the DataFrame:spark.sql("SELECT * FROM temp_table").show()
Best Practice: Use LIMIT
when previewing data
Avoid fetching large datasets during development. Use LIMIT
to preview data instead.
# Use LIMIT to preview data in SQL:
spark.sql("SELECT * FROM temp_table LIMIT 10").show()
Best Practice: Leverage Caching
Cache intermediate results in memory to optimise performance for iterative queries.
# Cache a DataFrame for future use:
df.cache()
Avoid using SELECT *
in production
Using SELECT *
can lead to unnecessary data transfer and slow performance, especially with large datasets.
# Bad practice - using SELECT *:spark.sql("SELECT * FROM temp_table")
# Better practice - select only needed columns:
Conclusion
With these cheat sheets, you can optimize your use of dbutils
commands, PySpark, and SQL in Databricks. These best practices and commands will help ensure that your data workflows are efficient, secure, and scalable. Mastering these commands will significantly enhance your productivity and the performance of your notebooks.