Databricks dbutils Cheat Sheet and PySpark & SQL Best Practice Cheat Sheet

Introduction

When working with Databricks, dbutils commands provide an easy interface for interacting with the file system, managing secrets, executing notebooks, and handling widgets. Coupled with PySpark and SQL, they form a powerful combination for managing and processing large-scale data. In this blog, we’ll cover the most useful dbutils commands and best practices for using PySpark and SQL in Databricks notebooks.

1. Databricks dbutils Cheat Sheet

File System (FS) Commands:

dbutils.fs commands allow you to interact with the file system, such as listing directories, creating directories, reading files, etc.

    # List files and directories:
    dbutils.fs.ls("/mnt/datasets")
    
    # Create a directory:
    dbutils.fs.mkdirs("/mnt/my-directory")
    
    # Remove a directory or file:
    dbutils.fs.rm("/mnt/my-directory", recurse=True)
    
    # Read the first bytes of a file:
    dbutils.fs.head("/mnt/my-file.txt", maxBytes=1024)
    
    # Put data into a file:
    dbutils.fs.put("/mnt/my-file.txt", "This is my data", overwrite=True)

Widgets Commands:

dbutils.widgets are used to create input forms in notebooks, making parameterization easy.

    # Create a text widget:
    dbutils.widgets.text("input", "default_value", "Input Widget")

    # Create a dropdown widget:
    dbutils.widgets.dropdown("dropdown", "Option1", ["Option1", "Option2"], "Select an Option")

    # Retrieve the value of a widget:
    dbutils.widgets.get("dropdown")
    
    # Remove a widget:
    dbutils.widgets.remove("dropdown")

Notebook Commands:

dbutils.notebook commands allow you to chain and organize your notebooks into modular workflows.

    # Run another notebook:
    dbutils.notebook.run("/Users/your-notebook", 60, {"param": "value"})
    
    # Exit a notebook with a return value:
    dbutils.notebook.exit("Success")

Secrets Commands:

dbutils.secrets allow secure storage and retrieval of credentials such as API keys, passwords, etc.

    # Get a secret from a scope:
    dbutils.secrets.get(scope = "my-scope", key = "my-key")
    
    # List all secrets in a scope:
    dbutils.secrets.list("my-scope")

Summary of dbutils Commands

The provided table offers a comprehensive cheat sheet for dbutils, a utility in Databricks that simplifies various operations within a workspace. Each category of commands is broken down to facilitate easy reference for users looking to enhance their productivity and streamline workflows.

  • File System Commands (fs): These commands allow users to interact with the Databricks file system (DBFS), enabling tasks such as listing files, creating directories, and removing unwanted files. For example, dbutils.fs.ls(path) lists all files and directories at the specified path, making it easier to navigate the file system.

  • Widgets: This section covers commands for creating interactive input fields in notebooks. Widgets such as text inputs and dropdowns enhance user interactivity and allow for dynamic data entry. An example command, dbutils.widgets.text(name, defaultValue, label), creates a text input for users to enter data.

  • Notebook Commands: Users can execute other notebooks programmatically using dbutils.notebook.run(path, timeoutSeconds, arguments), which is essential for modularizing workflows and improving code reusability.

  • Secrets Management: This category includes commands for securely handling sensitive information, such as API keys and passwords. For instance, dbutils.secrets.get(scope, key) retrieves secret values, ensuring that sensitive data remains protected.

  • Utilities: The utilities section features commands for managing libraries within Databricks. For example, dbutils.library.install(library) allows users to install required libraries quickly, which is crucial for data processing tasks.

Overall, this cheat sheet serves as a valuable resource for Databricks users, providing quick access to essential commands that facilitate effective data management and workflow optimization.

Category Command Description Example
File System (fs) dbutils.fs.ls(path) Lists files and directories in a specified path. dbutils.fs.ls("/databricks-datasets")
File System (fs) dbutils.fs.mkdirs(path) Creates a directory at the given path. dbutils.fs.mkdirs("/mnt/my-directory")
File System (fs) dbutils.fs.rm(path, recurse=True) Removes the file or directory at the specified path. Use recurse=True for directories. dbutils.fs.rm("/mnt/my-directory", recurse=True)
File System (fs) dbutils.fs.head(path, maxBytes) Displays the first maxBytes of the file at the specified path. dbutils.fs.head("/mnt/my-file.txt", maxBytes=100)
File System (fs) dbutils.fs.put(path, data, overwrite=True) Writes the specified data to a file at the given path. Overwrites if overwrite=True. dbutils.fs.put("/mnt/my-file.txt", "Hello Databricks!", overwrite=True)
Widgets dbutils.widgets.text(name, defaultValue, label) Creates a text input widget. dbutils.widgets.text("input_text", "default_value", "Input Text")
Widgets dbutils.widgets.dropdown(name, defaultValue, choices, label) Creates a dropdown widget with predefined choices. dbutils.widgets.dropdown("dropdown", "Option1", ["Option1", "Option2"], "Select an Option")
Widgets dbutils.widgets.get(name) Gets the current value of the widget. dbutils.widgets.get("dropdown")
Widgets dbutils.widgets.remove(name) Removes a widget by name. dbutils.widgets.remove("dropdown")
Notebook dbutils.notebook.run(path, timeoutSeconds, arguments) Runs a notebook located at the given path. dbutils.notebook.run("/Users/my-notebook", 60, {"arg1": "value"})
Notebook dbutils.notebook.exit(value) Exits the current notebook and returns the specified value. dbutils.notebook.exit("Completed")
Secrets dbutils.secrets.get(scope, key) Retrieves the secret value for a given key in the specified scope. dbutils.secrets.get("my-scope", "my-key")
Secrets dbutils.secrets.list(scope) Lists all secrets within the specified scope. dbutils.secrets.list("my-scope")
Utilities dbutils.library.install(library) Installs a library on the cluster. dbutils.library.install("com.databricks:spark-csv_2.10:1.2.0")
Utilities dbutils.library.restartPython() Restarts the Python process for the current notebook. dbutils.library.restartPython()
Utilities dbutils.help() Displays help for all available dbutils commands. dbutils.help()

2. PySpark Best Practices Cheat Sheet

Using PySpark within Databricks can be incredibly powerful, but it’s important to follow best practices to ensure code scalability and performance.

Basic PySpark Operations:

    # Create a DataFrame from a CSV file:
    df = spark.read.csv("/mnt/datasets/sample.csv", header=True, inferSchema=True)
    
    # Display the first few rows of the DataFrame:
    df.show(5)

    # Select columns from a DataFrame:
    df.select("column1", "column2").show()

Best Practice: Avoid using collect()

Avoid using collect() as it brings all data to the driver node, which can cause memory issues.

    # Bad practice - using collect() to bring all data to driver:
    data = df.collect()

    # Better practice - use show() or take() instead:
    df.show(5)

Repartitioning DataFrames:

Repartition your DataFrames to optimize performance when dealing with large datasets.

    # Repartition the DataFrame based on a column:
    df = df.repartition("column_name")

3. SQL Best Practices Cheat Sheet

SQL is widely used in Databricks for data querying and transformation. Below are some best practices to keep your queries optimized.

Common SQL Operations:

    # Creating a table from a DataFrame:
    df.createOrReplaceTempView("temp_table")
    
    # Running a SQL query on the DataFrame:
    spark.sql("SELECT * FROM temp_table").show()

Best Practice: Use LIMIT when previewing data

Avoid fetching large datasets during development. Use LIMIT to preview data instead.

    # Use LIMIT to preview data in SQL:
    spark.sql("SELECT * FROM temp_table LIMIT 10").show()

Best Practice: Leverage Caching

Cache intermediate results in memory to optimise performance for iterative queries.

    # Cache a DataFrame for future use:
    df.cache()

Avoid using SELECT * in production

Using SELECT * can lead to unnecessary data transfer and slow performance, especially with large datasets.

    # Bad practice - using SELECT *:
    spark.sql("SELECT * FROM temp_table")

    # Better practice - select only needed columns:
    
				

Conclusion

With these cheat sheets, you can optimize your use of dbutils commands, PySpark, and SQL in Databricks. These best practices and commands will help ensure that your data workflows are efficient, secure, and scalable. Mastering these commands will significantly enhance your productivity and the performance of your notebooks.

Leave a Comment

Your email address will not be published. Required fields are marked *