Skip to main content

Apache Airflow® Quickstart

It's easy to get your pipelines up and running with Apache Airflow®.

This quickstart offers three learning paths. Choose between these popular use cases:

  • Learning Airflow: an introduction to Airflow's lean and dynamic pipelines-as-Python-code
  • ETL: an introduction to modern, enhanced ETL development with Airflow
  • Generative AI: an introduction to generative AI model development with Airflow

Launch your journey with Airflow by signing up for a trial at astronomer.io! You'll be able to deploy your projects to Astro at the end of this tutorial.

Other ways to learn

For more help getting started, also check out our step-by-step Get Started with Airflow tutorial.

Time to complete

This quickstart takes approximately 30 minutes to complete.

Assumed knowledge

To get the most out of this quickstart, you should have an understanding of:

Prerequisites

Step 1: Clone the Astronomer Quickstart repository

  1. Create a new directory for your project and open it:

    mkdir airflow-quickstart-learning && cd airflow-quickstart-learning
  2. Clone the repository and open it:

    git clone -b learning-airflow --single-branch https://github.com/astronomer/airflow-quickstart.git && cd airflow-quickstart/learning-airflow

    Your directory should have the following structure:

    .
    ├── Dockerfile
    ├── README.md
    ├── dags
    │ └── example_astronauts.py
    ├── include
    ├── packages.txt
    ├── requirements.txt
    ├── solutions
    │ └── example_astronauts_solution.py
    └── tests
    └── dags
    └── test_dag_integrity.py

Step 2: Start up Airflow and explore the UI

  1. Start the project using the Astro CLI:

    astro dev start

    The CLI will let you know when all Airflow services are up and running.

  2. In your browser, navigate to localhost:8080 and sign in to the Airflow UI using username admin and password admin.

  3. Unpause the example_astronauts DAG.

  4. Explore the DAGs view (the landing page) and individual DAG view page to get a sense of the metadata available about the DAG, run, and all task instances. For a deep-dive into the UI's features, see An introduction to the Airflow UI.

    For example, the DAGs view will look like this screenshot:

    Airflow UI DAGs view

    As you start to trigger DAG runs, the graph view will look like this screenshot:

    Example Astronauts DAG graph view

    The Gantt chart will look like this screenshot:

    Example Astronauts DAG Gantt chart view

Step 3: Explore the project

This Astro project introduces you to the basics of orchestrating pipelines with Airflow. You'll see how easy it is to:

  • get data from data sources.
  • generate tasks automatically and in parallel.
  • trigger downstream workflows automatically.

You'll build a lean, dynamic pipeline serving a common use case: extracting data from an API and loading it into a database!

warning

This project uses DuckDB, an in-memory database. Although this type of database is great for learning Airflow, your data is not guaranteed to persist between executions!

For production applications, use a persistent database instead (consider DuckDB's hosted option MotherDuck or another database like Postgres, MySQL, or Snowflake).

Pipeline structure

An Airflow instance can have any number of DAGs (directed acyclic graphs), your data pipelines in Airflow. This project has two:

example_astronauts

This DAG queries the list of astronauts currently in space from the Open Notify API, prints assorted data about the astronauts, and loads data into an in-memory database.

Tasks in the DAG are Python functions decorated using Airflow's TaskFlow API, which makes it easy to turn arbitrary Python code into Airflow tasks, automatically infer dependencies, and pass data between tasks.

  • get_astronaut_names and get_astronaut_numbers make a JSON array and an integer available, respectively, to downstream tasks in the DAG.

  • print_astronaut_craft and print_astronauts make use of this data in different ways. The third task uses dynamic task mapping to create a parallel task for each Astronaut in the list retrieved from the API. Airflow lets you do this with just two lines of code:

    print_astronaut_craft.partial(greeting="Hello! :)").expand(
    person_in_space=get_astronaut_names()
    ),

    The key feature is the expand() function, which makes the DAG automatically adjust the number of tasks each time it runs.

  • create_astronauts_table in duckdb and load_astronauts_in_duckdb create a DuckDB database table for some of the data and load the data, respectively.

example_extract_astronauts

This DAG queries the database you created for astronaut data in example_astronauts and prints out some of this data. Changing a single line of code in this DAG can make it run automatically when the other DAG completes a run.

Step 4: Get your hands dirty!

With Airflow, it's easy to create cross-workflow dependencies. In this step, you'll learn how to:

  • use Airflow Datasets to create a dependency between DAGs so when one workflow ends another begins. To do this, you'll modify the example_extract_astronauts DAG to use a Dataset to trigger a DAG run when the example_astronauts DAG updates the table that both DAGs query.

Schedule the example_extract_astronauts DAG on an Airflow Dataset

With Datasets, DAGs that access the same data can have explicit, visible relationships, and DAGs can be scheduled based on updates to these datasets. This feature helps make Airflow data-aware and expands Airflow scheduling capabilities beyond time-based methods such as cron. Downstream DAGs can be scheduled based on combinations of Dataset updates coming from tasks in the same Airflow instance or calls to the Airflow API.

  1. Define the get_astronaut_names task as a producer of a Dataset. To do this, pass a Dataset object, encapsulated in a list, to the task's outlets parameter:

    @task(
    outlets=[Dataset("current_astronauts")]
    )
    def get_astronaut_names(**context) -> list[dict]:

    For more information about Airflow Datasets, see: Datasets and data-aware scheduling in Airflow.

  2. Schedule a downstream DAG run using an Airflow Dataset:

    Now that you have defined the get_astronauts task in the example_astronauts DAG as a Dataset producer, you can use that Dataset to schedule downstream DAG runs.

    Datasets function like an API to communicate when data at a specific location in your ecosystem is ready for use, reducing the code required to create cross-DAG dependencies. For example, with an import and a single line of code, you can schedule a DAG to run when another DAG in the same Airflow environment has updated a Dataset.

    To schedule the example_extract_astronauts DAG to run when example_astronauts updates the current_astronauts Dataset, add an import statement to make the Airflow Dataset package available:

    from airflow import Dataset
  3. Then, set the DAG's schedule using the current_astronauts Dataset:

    schedule=[Dataset("current_astronauts")],
  4. Rerun the example_astronauts DAG in the UI and check the status of the tasks in the individual DAG view. Watch as the example_extract_astronauts DAG gets triggered automatically when example_astronauts finishes running.

    If all goes well, the graph view of the Dataset-triggered DAG run will look like this screenshot:

    Dataset-triggered run graph view

    For more information about Airflow Datasets, see: Datasets and data-aware scheduling in Airflow.

Next steps: run Airflow on Astro

The easiest way to run Airflow in production is with Astro. To get started, create an Astro trial. During your trial signup, you will have the option of choosing the same template project you worked with in this quickstart.

Was this page helpful?