Skip to main content

· 4 min read
Marcin Rudolf

If rust + arrow + duckb is a new data engineering stack, now you can get a feel of it with dlt. We recently added native arrow tables (and panda frames) loading. What it means? You can pass an Arrow table to dlt pipeline.run or pipeline.extract methods, have it normalized, saved to parquet and loaded to your destination.

Here we achieved ~30x speedups when loading data from (local) postgres database using ConnectorX + Arrow compared to SqlAlchemy + json. (both use dlt as an engine, read disclaimer at the end!)

Load postgres table with Arrow

We’ll start with ConnectorX library that creates Arrow tables from SQL queries on most of the popular database engines.

pip install connectorx

Lib has Rust inside, zero copy extraction and is amazingly fast. We’ll extract and normalize 10 000 000 test rows from local postgresql. The table chat_message looks like Slack messages dump. Messages have unique autoincrement id which we use to load in chunks:

import connectorx as cx
import dlt
from dlt.sources.credentials import ConnectionStringCredentials

def read_sql_x(
conn_str: str
):
# load in chunks by one million
for _id in range(1, 10_000_001, 1_000_000):
table = cx.read_sql(conn_str,
"SELECT * FROM arrow_test_2.chat_message WHERE id BETWEEN %i AND %i" % (_id, _id + 1000000 - 1),
return_type="arrow2",
protocol="binary"
)
yield table

chat_messages = dlt.resource(
read_sql_x,
name="chat_messages"
)("postgresql://loader:loader@localhost:5432/dlt_data")

In this demo I just extract and normalize data and skip the loading step.

pipeline = dlt.pipeline(destination="duckdb", full_refresh=True)
# extract first
pipeline.extract(chat_messages)
info = pipeline.normalize()
# print count of items normalized
print(info)
# print the execution trace
print(pipeline.last_trace)

Let’s run it:

$ PROGRESS=enlighten python connector_x_speed.py
Items 10000001 [00:00, 241940483.70/s]
Normalized data for the following tables:
- _dlt_pipeline_state: 1 row(s)
- chat_messages: 10000000 row(s)

Run started at 2023-10-23T19:06:55.527176+00:00 and COMPLETED in 16.17 seconds with 2 steps.
Step extract COMPLETED in 16.09 seconds.

Step normalize COMPLETED in 0.08 seconds.

Load postgres table with SqlAlchemy

Here’s corresponding code working with SqlAlchemy. We process 10 000 000 rows, yielding in 100k rows packs and normalize to parquet in 3 parallel processes.

from itertools import islice
import dlt
from sqlalchemy import create_engine

CHUNK_SIZE=100000

def read_sql_a(conn_str: str):
engine = create_engine(conn_str)
with engine.connect() as conn:
rows = conn.execution_options(yield_per=CHUNK_SIZE).exec_driver_sql("SELECT * FROM arrow_test_2.chat_message")
while rows_slice := list(islice(map(lambda row: dict(row._mapping), rows), CHUNK_SIZE)):
yield rows_slice

chat_messages = dlt.resource(
read_sql_a,
name="chat_messages",
write_disposition="append",
)("postgresql://loader:loader@localhost:5432/dlt_data")

pipeline = dlt.pipeline(destination="duckdb", full_refresh=True)
# extract first
pipeline.extract(chat_messages)
info = pipeline.normalize(workers=3, loader_file_format="parquet")
print(info)
print(pipeline.last_trace)

Let’s run it:

$ PROGRESS=enlighten python sql_alchemy_speed.py
Normalized data for the following tables:
- _dlt_pipeline_state: 1 row(s)
- chat_messages: 10000000 row(s)

Run started at 2023-10-23T19:13:55.898598+00:00 and COMPLETED in 8 minutes and 12.97 seconds with 2 steps.
Step extract COMPLETED in 3 minutes and 32.75 seconds.

Step normalize COMPLETED in 3 minutes and 40.22 seconds.
Normalized data for the following tables:
- _dlt_pipeline_state: 1 row(s)
- chat_messages: 10000000 row(s)

Results

So we can see ~30x overall speedup on extract and normalize steps (~16 seconds vs ~8 minutes). The extract step is ~13x faster, while normalize is few thousand times faster. Arrow normalizer is just checking the schemas and moves parquet files around. JSON normalizer is inspecting every row to first infer the schema and then to validate the data.

As the output in both of methods is the same (parquet files) - the actual load step takes the same time in both cases and is not compared. I could easily push the load packages (parquet files) to any of supported destinations

What’s next:

Disclaimers

  • Playing field is not level. classical (sql alchemy) dlt run is processing data row by row, inferring and validating schema. that’s why it so slow. The Arrow version benefits from the fact, that data is already structured in the source.
  • We load from local database. That means that network roundtrip during extraction is not included. That isolates Arrow speedups well. In case of remote database engine, the speedups will be smaller.
  • You could optimize extract (both classical and arrow) by reading data from postgres in parallel or use partitions in ConnectorX

· 8 min read
Adrian Brudaru

tl;dr: You can kick off dbt jobs from Python - either by wrapping dbt Core, or by wrapping the Cloud API. But why should you use one over the other, and how to best do it to keep things simple?

Outline:

  1. What is dbt, and what’s the use case for Core and Cloud?

    • The Problem dbt Solves
    • What is dbt Core?
    • What is dbt Cloud?
    • When to Use One or the Other
    • Use Cases of dbt Cloud Over Core
  2. What are the use cases for running dbt core or Cloud from Python?

    • Case 1: Analytics Engineering and Data Engineering Teams
    • Case 2: Real-time Data Processing and Analytics
    • Case 3: Avoiding Library Conflicts
  3. Introducing dlt’s dbt runners - how the Extract and Load steps can trigger the Transform.

    • The Cloud runner
    • The Core runner
  4. A short demo on how to do that with dlt’s dbt runner.

    • dbt Cloud Runner Demo
    • dbt Core Runner Demo

1. What is dbt, and what’s the use case for Core and Cloud?

dbt (data build tool) is an open-source software that plays a crucial role in the data transformation process. It empowers data analysts and engineers to create, manage, and document data transformation workflows using SQL (Structured Query Language). dbt primarily focuses on solving the transformation aspect in ELT (Extract, Load, Transform) data processing.

The Problem dbt Solves

dbt addresses the challenge of efficient data transformation, streamlining the 'Transform' stage in ELT workflows. Traditionally, transforming raw data into a structured, analyzable format has been complex and laborious. dbt simplifies and automates this process, allowing users to define data transformations through SQL queries.

What is dbt Core?

dbt Core is the fundamental open-source version of dbt. It provides the essential features and functionalities for developing and running data transformation workflows using SQL scripts. dbt Core offers local execution capabilities, making it suitable for small to medium-scale projects run within a user's environment.

What is dbt Cloud?

dbt Cloud is a cloud-based platform provided by Fishtown Analytics, the company behind dbt. dbt Cloud offers a managed environment for running dbt, providing additional features and capabilities beyond what dbt Core offers. It is hosted on the cloud, providing a centralized, collaborative, and scalable solution for data transformation needs.

When to Use One or the Other?

The choice between dbt Core and dbt Cloud depends on various factors, including the scale of your data transformation needs, collaboration requirements, and resource constraints.

  • Use dbt Core:
    • For small to medium-sized projects.
    • When you prefer to manage and execute dbt locally within your environment.
    • If you have specific security or compliance requirements that necessitate an on-premises solution.
  • Use dbt Cloud:
    • For larger, enterprise-scale projects with significant data transformation demands.
    • When you require a managed, cloud-hosted solution to reduce operational overhead.
    • If you value collaborative features, centralized project management, and simplified access control.

But, dbt Core is free and open source, where dbt Cloud is paid. So let’s look into why we would use the paid service:

Use Cases of dbt Cloud Over Core

We could summarize this as: Cloud is the best solution if your Analytics engineer team wants analytics engineer specific tooling and does not want to concern itself with data-engineer specific tooling.

  1. Scalability and Performance: dbt Cloud provides seamless scalability to handle large-scale data transformation workloads efficiently.
  2. Collaboration and Team Management: dbt Cloud offers centralized project management and collaboration features, enhancing team productivity and coordination.
  3. Automated Task Scheduling: dbt Cloud allows for automated scheduling of dbt jobs, streamlining data transformation processes.
  4. Easy Integration with Cloud Data Warehouses: dbt Cloud integrates seamlessly with various cloud data warehouses, facilitating simplified setup and configuration.

So dbt Cloud is kind of like a standalone orchestrator, IDE and more.

2. What are the use cases for running dbt Core or Cloud from Python?

Case 1: You have an Analytics engineering team and a data engineering team that work with different tools.

This is a normal case to have in an enterprise teams, where we have a clear separation of responsibilities and tooling based on team preferences and competencies.

In this case, the Analyics Engineering team will use dbt Cloud for its convenient features, making them more effective.

However, the Data Engineers will want to ensure that the dbt models only run after new data has been loaded - not before, not after, and not at all in case the data did not load. So how to coordinate this?

To avoid race conditions, or dbt starting despite a broken loading pipeline, the data engineer needs to be able to trigger the dbt run and wait for it.

Of course, this is a case for the dbt Cloud runner.

Case 2: Real-time Data Processing and Analytics

In scenarios where you require real-time or near real-time data processing and analytics, integrating dbt with Python allows for dynamic and immediate transformations based on incoming data.

If you only refresh data once a day, you do not need the runners - you can set the loads to start at midnight, and the transforms to start at 7 AM. The hours in between are typically more than enough for loading to happen, and so you will have time to deliver the transformed data by 9 AM.

However, if you want to refresh data every 5, 15, 60 minutes or something similar, you will want to have fine grained control over calling the transform after loading the new increment.

Such, we have to be able to kick off the dbt job and wait for it, before starting the next refresh cycle.

Here, both the dbt Cloud and Core runners would fit.

Case 3. Avoiding Library conflicts between dbt Core and run environment.

If you are running dbt from some orchestrators, such as Airflow, you might find that you cannot, because installing dbt causes library conflicts with the base environment.

In such cases, you would want to create a venv or run the job off the orchestrator.

Such, both the Cloud runner and the Core runner with virtual env would fit well here.

3. Introducing the dbt runners we have created in open source

Here at dlt we solve the EL in the ELT - so naturally we want to kick off dbt to solve the T.

dlt is an open source library made for easily building data pipelines for Python first people.

The dlt library auto cleans data and generates database-agnostic schemas before loading - so regardless of which database we use, our schema is the same. This provides a unique opportunity to standardise dbt packages on top using cross db macros.

So let’s look at the 2 runners we offer:

The Cloud runner

Docs link: dbt Cloud runner docs.

The Cloud runner we support can do the following:

  • Start a dbt job in your dbt Cloud account, optionally wait for it to finish.
  • Check the status of a dbt job in your account.

Code example:

from dlt.helpers.dbt_cloud import run_dbt_cloud_job

# Trigger a job run with additional data
additional_data = {
"git_sha": "abcd1234",
"schema_override": "custom_schema",
# ... other parameters
}
status = run_dbt_cloud_job(job_id=1234, data=additional_data, wait_for_outcome=True)
print(f"Job run status: {status['status_humanized']}")

Read more about the additional data dbt accepts in their docs.

The core runner

Docs link: dbt Core runner docs.

The core runner does the following:

  • Run dbt core from a local or repository package path.
  • Set up the running:
    • Optionally install a venv.
    • Install dbt if not exists.
    • Copy over the remote package.
    • Inject credentials from dlt (which can be passed via env, vaults, or directly).
    • Execute the package and report the outcome.

Code example:

# Create a transformation on a new dataset called 'pipedrive_dbt'
# we created a local dbt package
# and added pipedrive_raw to its sources.yml
# the destination for the transformation is passed in the pipeline
pipeline = dlt.pipeline(
pipeline_name='pipedrive',
destination='bigquery',
dataset_name='pipedrive_dbt'
)

# make or restore venv for dbt, using latest dbt version
venv = dlt.dbt.get_venv(pipeline)

# get runner, optionally pass the venv
dbt = dlt.dbt.package(
pipeline,
"pipedrive/dbt_pipedrive/pipedrive",
venv=venv
)

# run the models and collect any info
# If running fails, the error will be raised with full stack trace
models = dbt.run_all()

# on success print outcome
for m in models:
print(
f"Model {m.model_name} materialized" +
f"in {m.time}" +
f"with status {m.status}" +
f"and message {m.message}"

4. A short demo on how to do that with dlt’s dbt runner.

dbt Cloud runner

In this example, we start from the Pokemon API, load some data with dlt, and then kick off the dbt run in our dbt Cloud account.

GitHub repo: dbt Cloud runner example.

dbt Core runner

In this example, we copy GA4 events data from BigQuery into DuckDB, and run a dbt package to calculate metrics.

Article: BQ-dlt-dbt_core-MotherDuck.

Accompanying GitHub repo: dbt Core runner example.

In conclusion

Running dbt from Python is an obvious necessity for a data team that also uses Python for ingestion, orchestration, or analysis. Having the 2 options to run Cloud or Core versions of dbt enables better integration between the Transform component and the rest of the data stack.

Want more?

· 9 min read
Adrian Brudaru

In this article I will not discuss the best data warehouse you could in theory build. Instead, I will describe how data warehousing projects pragmatically start in order to have an easy time building and improving without running into early limits.

Building a data warehouse is a complex endeavor, often too intricate to navigate flawlessly in the initial attempt. In this article, we'll provide insights and pointers to guide you in choosing the right stack for your data warehouse.

hard coded dashboard

The Business requirements

Understanding the business's performance is the initial priority, and achieving this necessitates a comprehensive understanding of the business model and its various intricacies. Tracking key processes and Key Performance Indicators (KPIs) is fundamental as they provide insights into the business's health and performance across various aspects such as sales, marketing, customer engagement, operational efficiency, and financial health.

Collaboration with different departments is crucial to comprehensively grasp their unique perspectives and priorities. Engaging with stakeholders ensures that the data warehouse is designed to cater to a wide array of informational needs, aligning with the organizational goals and objectives.

Furthermore, identifying pivotal business drivers is essential. Beyond explicit feedback, it's crucial to recognize the primary business levers often represented by cross-departmental data. These drivers shed light on the core aspects that significantly impact the business's success. For instance, in an e-commerce business, the main levers might focus on increasing customer lifetime value, improving conversion rates, and optimizing ad spend to align with the customer's worth.

The Tech stack

Orchestration

Orchestration functions as the central control mechanism, overseeing and coordinating the execution of diverse data workflows.

For your first data warehouse build, opting for a managed solution often proves pragmatic. Major cloud platforms provide managed versions of orchestrators like Airflow, ensuring reliability and relieving the burden of self-hosting. While this convenience comes at some cost, the investment is justified considering the potential intricacies and management efforts associated with self-hosting, which could potentially outweigh server expenses. Keep in mind that cloud vendors like gcp will only charge for the rented services/hardware, and so their managed airflow is priced the same as the one you would manage.

The most well known orchestrator is Airflow which is open source maintained by an open source community.

There are many newer orchestrators that improve on Airflow’s design and shortcomings, with varying features and approaches. Prefect, Dagster, Mage, and Kestra stand out as prominent contenders, introducing unique features and approaches that push the boundaries of orchestration.

Besides standards, you can always go for simplicity by looking out of the box - Github Actions is actually an orchestrator and while not particularly feature rich, it is often sufficient for a basic load-trasnform setup.

Ingestion

Future-proofing your data warehouse cannot be done by relying on the hope that vendors will fulfill your requirements. While it is easy to start with a SaaS pipeline solution, they are generally expensive and end up vendor locking you to their schema, creating migration pains if you want to move and improve. There are also reasons to use SaaS such as not having the in-house python team or deciding to suffer the cost and outsource the effort.

But one way or another, you end up building custom pipelines for reasons like:

  • SQL pipelines are simple to create but cost a ton on SaaS services.
  • The vendor does not have all the endpoints and too few customers asked for it for them to care.
  • You start using a new service the vendor doesn’t offer.

So, to have a clean setup, you would be better off standardizing a custom ingester. Here, you can write your own, or use the dlt library which is purpose-made and will generate database agnostic schemas, enabling migration between databases at the flip of a switch - making your test setups even easier.

If you do write your own, choose a common interchange format and create it to load from that (such as json) and have all your extractors output json.

You could also consider customizable solutions like Airbyte or Meltano. However, they follow their own paradigms, which ultimately create difficulties when trying to maintain or keep a stable, robust stack.

Transforming Data

Transforming raw data into a structured, analytical format is a pivotal step in the data pipeline. In this domain, dbt stands out as a robust solution with widespread adoption, extensive documentation, and now, a standard tool. However, it's not the only player. Alternatives like SQLMesh are evolving this space, introducing enhancements tailored to specific use cases. For instance, SQLMesh's innovation in achieving database agnosticism through the use of sqlglot under the hood sets it apart.

When it comes to data modeling, star schemas emerge as the preferred choice for many due to their advantages, including efficient and clear code and support for ROLAP tools. However, it's crucial to note that the transformation code is both quantitative and complex, making adherence to best practices imperative for maintenance and scalability.

Reverse ETL

While implementing Reverse ETL might not be an initial priority, it's essential to demystify the process. For those new to pushing data via an API, it may seem intimidating. Let's simplify - sending data to an API endpoint for loading or updating an object is similar to making a GET request. Here's a straightforward example in Python:

# Assuming data is in this format
import requests
# assume we have a table of contacts we want to push to Pipedrive.
data_table = [{'name': 'abc', 'email': 'abc@d.com'},]

# Post the data to this endpoint
API_URL = f'https://api.pipedrive.com/v1/persons?api_token={YOUR_API_TOKEN}&pipeline_id={PIPELINE_ID}'
for row in data_table:
response = requests.post(API_URL, headers=headers, data=json.dumps(row))

For those seeking tools, Census and Hightouch are prominent players in this space.

Dashboards & their usage paradigms

When it comes to dashboards, each tool follows its distinctive paradigm. For example, Tableau and PowerBI are good for analysts to make polished dashboards for business users, while Metabase offers more simplicity and self service for more technically able business users.

If you're uncertain about your choice, starting with something simple and rooted in ROLAP (Relational Online Analytical Processing) is a sound approach. ROLAP plays a pivotal role and should not be underestimated—it's the linchpin for leveraging star schemas.

But what exactly is ROLAP? ROLAP lets you define links between tables, allowing the tool to present data as if it's pre-joined, performing actual joins only when needed.

Essentially, ROLAP transforms a star schema into what appears to be a single table for the end user. This setup empowers users to navigate and explore data seamlessly using a "pivot-table" like interface commonly found in BI tools.

By using ROLAP, we are able to maintain single versions of dimension tables, and reduce maintenance efforts while increasing flexibility and velocity for the end user.

Data stack Governance

This section sheds light on strategies for efficient management of your data stack. Here are key tips to get you started:

  • Version control is essential: Version control, like using Git, is akin to having a safety net for your code. It ensures you can track changes, collaborate seamlessly, and revert to previous versions if needed.

  • Early alert setup: Implementing alert mechanisms from the get-go is like having a diligent watchdog for your data. It helps you catch issues early, preserving trust in your data. Check out this guide on using dlt to send alerts to Slack.

  • Streamlined workflows and CI/CD: Streamlining your workflows and embracing CI/CD is like putting your data operations on an express lane. It speeds up development, minimizes errors, and ensures a smoother deployment process. If you're using Airflow on GCP, this simple setup guide is your friend.

  • Assumption testing: Adding comprehensive tests is akin to having a safety net beneath a trapeze artist. It gives you the confidence to make changes or additions without fearing a crash.

  • Goal-oriented KPI definition: When defining KPIs, always keep the end goal in mind. Tailor your KPIs to what matters most for each business function. Marketing may dance to the tune of signups, Finance to contracts, and Operations to active users.

  • Implement lineage for faster Troubleshooting: Implementing lineage is like having a well-organized toolbox. It helps you trace and understand the journey of your data, making troubleshooting and model iteration a breeze.

These foundational practices form the cornerstone of effective data stack governance, ensuring a sturdy structure that grows with your data needs.

In Conclusion: a simple beginning, a challenging growth

Initiating a data warehouse project doesn't have to be an uphill struggle. In fact, starting with simplicity can often be the wisest path. With minimal effort, you can accomplish a great deal of what a data team requires in the initial stages.

The true test lies in scaling—the journey from a streamlined beginning to a comprehensive, organization-wide data infrastructure. This evolution is where most of the challenge happens - adoption, stakeholder education and culture change happen in this step too. However, it's worth noting that having an entire team of data experts right at the start of this journey is a rarity. Therefore, while scaling is a critical aspect, delving into the intricacies of extensive team and organizational scaling ventures beyond the scope of this article.

Resources

If you're building on Google Cloud Platform (GCP), here are some tutorials and resources that can aid you in your data warehouse setup:

  1. Deploy Cloud Composer with CI/CD from GitHub Repo Tutorial Link: Deploy Cloud Composer with CI/CD

  2. Deploy DLT to Cloud Composer Tutorial Link: Deploy dlt to Cloud Composer

  3. Deploy dbt to Cloud Composer Tutorial Link: Deploy dbt to Cloud Composer

  4. Setting up Alerts to Slack Tutorial Link: Setting up Alerts to Slack. For integrating it into on-failure callbacks, refer to the Apache Airflow documentation

  5. Example ROLAP Definition on Holistics Tool Tutorial Link: Example ROLAP Definition on Holistics Tool

Want to discuss dlt and data lakes or warehouses?

· 7 min read
Adrian Brudaru

We often see people talk about data products or data as a product, and they usually tackle the topics of:

  • Concepts and how to think about data products.
  • Users and producers: Roles, responsibilities and blame around data products.
  • Data: Data quality and governance that is part of those products, data as a product.
  • Code: The code or technology powering the pipelines.
  • Infra: the infrastructure data products are run on.

What we do not see is any practical advices or examples of how to implement these products. While the concepts often define data products as something with a use case, they fail to discuss the importance a user manual, or documentation.

The role of the user manual

So what is a data product?

A data product is a self-contained piece of data-powered software that serves a single use case. For example, it could be a pipeline that loads Salesforce data to Snowflake, or it could be an ML model hosted behind an api. Many talk about data products as some kind of inter-company exchange - like one company does it and another reuses it. However, the prevalent case is when we have a team building it and another using it - just like a "production backend", these internal data tools help the business run their processes and are an integral part of the company and their product.

Always consider the use case for the description of the product, but the entire technical stack as part of the product - so, the code and data responsible for enabling the use case are part of the product.

Examples of data products:

  • Lead ranking algorithm that helps the sales team prioritise their leads based on rules and maybe data.
  • ROI calculator that enables the marketing team to optimise profits or expansion via better bidding and reinvestment efforts.
  • Data pipeline that creates a report that is core for the finance team, containing things the finance team defines and wants.
  • A data contract that alerts if salesforce leads do not have a corresponding company in the production system.
  • A series of calculations that segment customers by various features we can use for targetting.
  • A data mart that enables the CRM team select subsets of users by ad-hoc defined behavior.
  • A data pipeline that provides externals with data.
  • A report which we offer via api for external consumption.
  • An api endpoint that produces a content recommendation for a particular website slot.
  • A dashboard that enables the Account Management team to prioritise who they should reach out to, to enable them to reach their goals.

What makes a data pipeline a data product?

The term product assumes more than just some code. A "quick and dirty" pipeline is what you would call a "proof of concept" in the product world and far from a product.

Who the duck wrote this garbage??? Ah nvm… it was me…

Who the duck wrote this trash??? Ahhhhh it was me :( ...

To create a product, you need to consider how it will be used, by whom, and enable that usage by others.

A product is something that you can pick up and use and is thus different from someone’s python spaghetti.

For example, a product is:

  • Reusable: The first thing needed here is a solid documentation that will enable other users to understand how to use the product.
  • Robust: Nothing kills the trust in data faster than bad numbers. To be maintainable, code must be simple, explicit, tested, and documented :)
  • Secure: Everything from credentials to data should be secure. Depending on their requirements, that could mean keeping data on your side (no 3rd party tools), controlling data access, using SOC2 compliant credential stores, etc.
  • Observable: Is it working? how do you know? you can automate a large part of this question by monitoring the volume of data and schema changes, or whatever other important run parameters or changes you might have.
  • Operationizable: Can we use it? do we need a rocket scientist, or can little Bobby Tables use it? That will largely depend on docs and the product itself.

So what is a data product made of?

Let’s look at the high-level components:

  1. Structured data: A data product needs data. The code and data are tightly connected - an ML model or data pipeline cannot be trained or operate without data. Why structured? because our code will expect a structured input, so the data is going to be either explicitly structured upfront (”schema on write”), or structured implicitly on read (”schema on read”).
  2. Code.
  3. Docs for usage - Without a user manual, a complex piece of code is next to unusable.

And which docs are needed?

We will need top level docs, plus some for each of the parts described above.

  1. Top level: Purpose of existence for the data product. The code describes the what and how - So focus the readme on the “why” and top level “what”. Similar to a problem description, this document explains what problem your product solves and enables the reader to understand the cost and impact it might have to use your product.
  2. Structured data:
    1. A data dictionary enables users to gain literacy on the dataset.
    2. Maintenance info: information about the source, schema, tests, responsible person, how to monitor, etc.
  3. Code & Usage manual: This one is harder. You need to convey a lot of information effectively, and depending on who your user is, you need to convey that information in a different format. According to the best practices on the topic of docs, these are the 4 relevant formats you should consider. They will enable you to write high-quality, comprehensive and understandable docs that cover the user’s intention.
    • learning-oriented tutorials;
    • goal-oriented how-to guides;
    • understanding-oriented discussions;
    • information-oriented reference material.

Some examples from dlt

Dlt is a library that enables us to build data pipelines. By building with dlt, you benefit from simple declarative code and accessible docs for anyone maintaining later. Assuming you use dlt or your own loading approach in your data platform, you will want to document both the tool used, to enable people to modify things, and the popelines themselves to describe semantically what is being loaded. Here are some examples of how you could do that:

  • Top level: Here is our attempt for dlt itself - the intro doc. You could describe the problem or use case that the pipeline solves.
  • Data dictionary: Schema info belongs to each pipeline and can be found here. To get sample values, you could write a query. We plan to enable its generation in the future via a “describe” command.
  • Maintenance info: See how to set up schema evolution alerts. You can also capture load info such as row counts to monitor loaded volume for abnormalities.
  • Code and usage: We are structuring all our docs to follow the best practices around the 4 types of docs, generating a comprehensive, recognisable documentation. We also have a GPT assistant on docs, and we answer questions in Slack for conversational help.

In conclusion

Stop thinking about data, code and docs in isolation - they do not function independently, they are different parts of the same product. To produce quality documentation, focus on the why, let the code show the what and how. and use standard formats for teaching complex tooling.

Want to create data products with dlt? What are you waiting for?

· 7 min read
info

TL;DR I set up a data pipeline that automatically extracts data from PDF invoices in my Gmail and puts it all in a single Google Sheet where I can easily keep of track of it. I did this using the python library dlt that uses langchain and LLMs to read PDF data and converts it into structured tables.

I am Anna, co-founder & COO of dltHub. As an ops lead with many years of running SMB-size startups, I find myself juggling a myriad of tasks, from administration, finance, and people to customer support or sales. These tasks come with their own data, all of which are crucial for making decisions. This creates a huge scope for automation, but unfortunately getting engineering support is not always easy. Whether it's integrating tools with APIs or managing data efficiently, the waiting game can be frustrating.

So, I often end up doing manual tasks such as budgeting, cost estimation, updating CRM, or preparing audits. I have been dreaming about automating these processes.

For example, I need to analyze expenses in order to prepare a budget estimation. I get numerous PDFs daily in a dedicated Gmail group inbox. I was wondering to which extent dlt can help fulfill my automation dream. I decided to work with Alena from our data team on an internal project.

invoice flow chart

Use Case

Imagine this scenario: your team receives numerous invoices as email attachments daily. You need to extract and analyze the data within these invoices to gain insights crucial to your operations. This is where the data load tool (dlt) steps in.

Alena created a pipeline using dlt that automates the process of translating invoices received as email attachments in a specific Google email group and stores them in a database (for example, BigQuery or DuckDB).

As a non-coder working in tech startups for a long time, I finally got a chance to learn how to use the terminal and run a simple pipeline.

Here's a summary of how it works.

Let’s get started

In this article, I will show you an example of loading structured data from invoices received by email into BigQuery. For more details, check the README.md in the GitHub repository.

Step 1. Preparation

Make sure that you have all you need:

  • Make sure you have Python 3.x installed on your system.
  • Use a virtual environment (more details on how to set up the environment).
  • Install the dlt library by using pip install "dlt[bigquery]".
  • Create a project folder on your laptop. I called mine “unstructured_data_pipeline”.
  • We will need access to LLM, Langchain will use OpenAI models by default, so we also used an OpenAI API token.
  • Using a tool like Visual Studio makes it easier.

Step 2. Initiate the pipeline

To create the pipeline, we will use the dlt verified source unstructured_data, which includes the verified source inbox.

  • Init the pipeline by using dlt init unstructured_data bigquery.
  • Install necessary requirements pip install -r requirements.txt.

Step 3. Set up your credentials

The dlt init command creates folder .dlt in your project directory, and clones the source code from the verified-sources repository.

  • Open .dlt/secrets.toml file on your laptop.

  • Enter the OpenAI secrets:

    [sources.unstructured_data]
    openai_api_key = "openai_api_key"
  • Enter your email account secrets in the same section [sources.unstructured_data]:

    host = 'imap.example.com'
    email_account = "example@example.com"
    password = 'set me up!'

    Check here how to configure the inbox source.

  • Enter the BigQuery secrets:

    [destination.bigquery]
    location = "US"
    [destination.bigquery.credentials]
    project_id = "set me up!"
    private_key = "set me up!"
    client_email = "set me up!"

Read more about dlt credentials and BigQuery credentials.

Step 5: Define your queries

This is the part where you can define what you’d like to see as an outcome.

Queries example:

INVOICE_QUERIES = {
"recipient_company_name": "Who is the recipient of the invoice? Just return the name. If you don't know, then return None",
"invoice_amount": "What is the total amount of the invoice? Just return the amount as decimal number, no currency or text. If you don't know, then return None",
"invoice_date": "What is the date of the invoice? Just return the date. If you don't know, then return None",
"invoice_number": "What is the invoice number? Just return the number. If you don't know, then return None",
"service_description": "What is the description of the service that this invoice is for? Just return the description. If you don't know, then return None",
}

Customize the INVOICE_QUERIES dictionary in the unstructured_data/settings.py file if you want to extract other information, or if your invoices have a different structure.

Step 6: Run the pipeline!

And now the magic happens. Use the following command to run the pipeline:

python unstructured_data_pipeline.py

In the next step, dlt will save all processed structured data to the database (in my case, BigQuery).

Step 7: Check the outcome in BigQuery

If you load it to BigQuery like I did in my example, then you can look at your data using BigQuery UI or export it directly to a Google sheet.

Step 8: Deploy

Now you can deploy this script with GitHub Actions as we did, so that it checks your incoming email every day and processes invoices automatically.

Outcome:

Here’s how the result looks like in BigQuery:

screenshot 1

…and as a Google Sheet. You can easily export this table from BigQuery to Google Sheets using the Export button in the top right corner.

screenshot 2

Bonus: In order to have a Google Sheet with live updates, you can go to the Data tab in your Spreadsheet → Data Connectors → BigQuery → choose your database and voila, your data will be updated automatically.

screenshot 3

Conclusion:

This worked well:

  • dlt was good at extracting the data I needed, and it indeed worked in real-time. I needed some support from Alena when running the pipeline for the first time, but that’s because I had never coded before. 😊
  • I was able to see the details that are relevant to my workaround budgeting.

This did not work well:

  • Some PDFs don’t get transformed correctly. Some details were missing or misspelled. That depends on the LLM, which extracts structured data from a raw text. And also on the invoice structure.
  • And it only worked well with digital PDFs, though not with JPG/scanned versions. Luckily, 99% of all the invoices are of the former kind. However, you can even set up this process for the other kinds of documents by making modifications to unstructured.io.

Where to go next?

It was definitely a great start, and we will test it further. And I already have many other use cases where dlt pipelines could help with ops automation processes. E.g.:

  • In creating a list of all contracts based on PDFs in a Google Drive folder (super useful for audits).
  • In moving specific data to CRM (e.g. invoice related information about the customers).

This specific example illustrates just one way in which Operations Leads can harness the power of dlt to analyze data efficiently without relying on engineers for extensive support. By automating data processes and enabling real-time insights, dlt empowers small startups to make informed decisions and stay competitive in their respective markets.

In the startup world where time is of the essence, dlt has a chance to be the key to unlock data's full potential and accelerate operational efficiency. I’m looking forward to saying goodbye to endless waiting and hello to a world where Operations Leads can take control of their data needs, all thanks to dlt.

· 19 min read
Zaeem Athar
info

TL;DR: A modern analytics stack with dlt and Holistics to transform and ingest unstructured production data from MongoDB to flat tables in BigQuery for self-service analytics.

If you’re a CTO, then you probably love MongoDB: it’s scalable, production-ready, and a great dump for unstructured, and semi-structured data. If you’re however a data scientist or data analyst and you need to run analytics on top of MongoDB data dumps, then you’re probably not a fan. The data in MongoDB needs to be transformed and stored in a data warehouse before it is ready for analytics. The process of transforming and storing the data can become quite tedious due to the unstructured nature of the data.

In this blog, we will show you how you can combine dlt and Holistics and create a modern data stack that makes the process of extracting unstructured data from MongoDB, and running self-service analytics on the data simple and straightforward. We will use dlt to ingest the Movie Flix Dataset into BigQuery from MongoDB and use Holistics to transform the data and run self-service analytics.

An Overview of the MongoDB Modern Analytics Stack

Diagram illustrating the inner workings of our Modern Analytics Stack

ToolLayerWhy it’s awesome
MongoDBProductionSometimes used as a data dump by CTOs. Often stores unstructured, semi-structured production data that stakeholders want to access.
dltData IngestionMongo is great, but then others struggle to analyze the data. dlt extracts data from MongoDB, creates schema in BigQuery, and loads normalized MongoDB data into BigQuery.
BigQueryData WarehouseBecause of its pricing model, it’s a good data warehouse choice to store structured MongoDB data so it can be used by BI tools like Holistics for self-service analytics.
HolisticsData Modeling for Self-Service AnalyticsHolistics makes it easy for data teams to setup and govern an end-user self-service analytics platform using DevOps best practices

In our stack, dlt resides in the data ingestion layer. It takes in unstructured data from MongoDB normalizes the data and populates it into BigQuery.

In the data modeling layer, Holistics accesses the data from BigQuery builds relationships, transforms the data, and creates datasets to access the transformations. In the reporting layer, Holistics allows stakeholders to self-service their data by utilizing the created datasets to build reports and create visualizations.

MongoDB is loved by CTOs, but its usage creates issues for stakeholders.

NoSQL databases such as MongoDB have gained widespread popularity due to their capacity to store data in formats that align more seamlessly with application usage, necessitating fewer data transformations during storage and retrieval.

MongoDB is optimized for performance and uses BSON (Binary Javascript Object Notation) under the hood as compared to JSON. This allows MongoDB to support custom and more complex data types, such as geospatial data, dates, and regex. Additionally, BSON supports character encodings.

All these benefits enable MongoDB to be a faster and better database, but the advantages of the flexibility offered by MongoDB are sometimes abused by developers and CTOs who use it as a dump for all types of unstructured and semi-structured data. This makes this data inaccessible to stakeholders and unfit for analytics purposes.

Moreover, the unique nature of MongoDB with its BSON types and its usage as a data dump in current times mean that additional hurdles must be crossed before data from MongoDB can be moved elsewhere.

How does our Modern data stack solve the MongoDB problem?

In the data ingestion layer, dlt utilizes the MongoDB verified source to ingest data into BigQuery. Initializing the MongoDB verified source setups default code needed to run the pipeline. We just have to setup the credentials and specify the collections in MongoDB to ingest into BigQuery. Once the pipeline is run dlt takes care of all the steps from extracting unstructured data from MongoDB, normalizing the data, creating schema, and populating the data into BigQuery.

Getting your data cleaned and ingested into a data warehouse is just one part of the analytics pipeline puzzle. Before the data is ready to be used by the entire organization the data team must model the data and document the context of data. This means defining the relationships between tables, adding column descriptions, and implementing the necessary transformations. This is where Holistics shines. With analytics-as-code as first-class citizens, Holistics allows data teams to adopt software engineering best practices in their analytics development workflows. This helps data teams govern a centralized curated set of semantic datasets that any business users can use to extract data from the data warehouse.

Why is dlt useful when you want to ingest data from a production database such as MongoDB?

Writing a Python-based data ingestion pipeline for sources such as MongoDB is quite a tedious task as it involves a lot of overhead to set up. The data needs to be cleaned before it is ready for ingestion. Moreover, MongoDB is a NoSQL database meaning it stores data in a JSON-like data structure. So if you want to query it with SQL natively, you will need to transform this JSON-like data structure into flat tables. Let's look at how this transformation and cleaning can be done:

  • Create a Data Model based on the MongoDB data we intend to ingest.
  • Create tables in the data warehouse based on the defined Data Model.
  • Extract the data from MongoDB and perform necessary transformations such as Data Type conversion (BSON to JSON), and flattening of nested data.
  • Insert the transformed data into the corresponding SQL tables.
  • Define relationships between tables by setting up primary and foreign keys.

Using the dlt MongoDB verified source we can forgo the above-mentioned steps. dlt takes care of all the steps from transforming the JSON data into relational data, to creating the schema in the SQL database.

To get started with dlt we would need to set some basic configurations, while everything else would be automated. dlt takes care of all the steps from creating schema to transforming the JSON data into relational data. The workflow for creating such a data pipeline in dlt would look something like this:

  • Initialize a MongoDB source to copy the default code.
  • Set up the credentials for the source and destination.
  • Define the MongoDB collection to ingest (or default to all).
  • Optionally configure incremental loading based on source logic.

What is useful about Holistics in this project?

Holistics is a Business Intelligence platform with the goal of enabling self-service analytics for entire organizations. Holistics works by connecting to an SQL data warehouse. This allows it to build SQL queries and execute them against the data warehouse. In essence, Holistics utilizes the storage and processing capabilities of the data warehouse and the data never leaves the data warehouse.

To enable self-service Holistics introduces a modeling layer. The data teams use this layer to define table relationships, data transformations, metrics, and data logic. The entire organization can utilize these metrics and data logic defined in this layer to self-service their data needs.

In addition to the transformation layer, Holistics provides advanced features such as defining models using code through Holistics’ analytics-as-code languages (AMQL) and utilizing Git version control systems to manage code changes. Moreover, data teams can integrate with dbt to streamline the data transformations.

The overall Holistics workflow looks something like this:

Holistics Overview

  • Connect Holistics to an existing SQL data warehouse.
  • Data teams use Holistics Data Modeling to model and transform analytics data. This model layer is reusable across reports & datasets.
  • Non-technical users can self-service explore data based on predefined datasets prepared by data teams. They can save their explorations into dashboards for future use.
  • Dashboards can be shared with others, or pushed to other platforms (email, Slack, webhooks, etc.).

Code Walkthrough

In this section, we walk through how to set up a MongoDB data pipeline using dlt. We will be using the MongoDB verified source you can find here.

1. Setting up the dlt pipeline

Use the command below to install dlt.

pip3 install -U dlt

Consider setting up a virtual environment for your projects and installing the project-related libraries and dependencies inside the environment. For best installation practices visit the dlt installation guide.

Once we have dlt installed, we can go ahead and initialize a verified MongoDB pipeline with the destination set to Google BigQuery. First, create a project directory and then execute the command below:

dlt init mongodb bigquery

The above command will create a local ready-made pipeline that we can customize to our needs. After executing the command your project directory will look as follows:

.
├── .dlt
│ ├── config.toml
│ └── secrets.toml
├── mongodb
│ ├── README.md
│ ├── __init__.py
│ └── helpers.py
├── mongodb_pipeline.py
└── requirements.txt

The __init__.py file in the mongodb directory contains dlt functions we call resources that yield the data from MongoDB. The yielded data is passed to a dlt.pipeline that normalizes the data and forms the connection to move the data to your destination. To get a better intuition of the different dlt concepts have a look at the docs.

As the next step, we set up the credentials for MongoDB. You can find detailed information on setting up the credentials in the MongoDB verified sources documentation.

We also need to set up the GCP service account credentials to get permissions to BigQuery. You can find detailed explanations on setting up the service account in the dlt docs under Destination Google BigQuery.

Once all the credentials are set add them to the secrets.toml file. Your file should look something like this:

# put your secret values and credentials here. do not share this file and do not push it to github
[sources.mongodb]
connection_url = "mongodb+srv://<user>:<password>@<cluster_name>.cvanypn.mongodb.net" # please set me up!
database = "sample_mflix"

[destination.bigquery]
location = "US"
[destination.bigquery.credentials]
project_id = "analytics" # please set me up!
private_key = "very secret can't show"
client_email = "<org_name>@analytics.iam.gserviceaccount.com" # please set me up!

The mongodb_pipeline.py at the root of your project directory is the script that runs the pipeline. It contains many functions that provide different ways of loading the data. The selection of the function depends on your specific use case, but for this demo, we try to keep it simple and use the load_entire_database function.

def load_entire_database(pipeline: Pipeline = None) -> LoadInfo:
"""Use the mongo source to completely load all collection in a database"""
if pipeline is None:
# Create a pipeline
pipeline = dlt.pipeline(
pipeline_name="local_mongo",
destination='bigquery',
dataset_name="mongo_database",
)

# By default the mongo source reflects all collections in the database
source = mongodb()

# Run the pipeline. For a large db this may take a while
info = pipeline.run(source, write_disposition="replace")

return info

Before we execute the pipeline script let's install the dependencies for the pipeline by executing the requirements.txt file.

pip install -r requirements.txt

Finally, we are ready to execute the script. In the main function uncomment the load_entire_database function call and run the script.

python mongodb_pipeline.py

If you followed the instructions correctly the pipeline should run successfully and the data should be loaded in Google BigQuery.

2. The result: Comparing MongoDB data with the data loaded in BigQuery.

To get a sense of what we accomplished let's examine what the unstructured data looked like in MongoDB against what is loaded in BigQuery. Below you can see the sample document in MongoDB.

{
"_id": {
"$oid": "573a1390f29313caabcd42e8"
},
"plot": "A group of bandits stage a brazen train hold-up, only to find a determined posse hot on their heels.",
"genres": [
"Short",
"Western"
],
"runtime": {
"$numberInt": "11"
},
"cast": [
"A.C. Abadie",
"Gilbert M. 'Broncho Billy' Anderson",
"George Barnes",
"Justus D. Barnes"
],
"poster": "https://m.media-amazon.com/images/M/MV5BMTU3NjE5NzYtYTYyNS00MDVmLWIwYjgtMmYwYWIxZDYyNzU2XkEyXkFqcGdeQXVyNzQzNzQxNzI@._V1_SY1000_SX677_AL_.jpg",
"title": "The Great Train Robbery",
"fullplot": "Among the earliest existing films in American cinema - notable as the first film that presented a narrative story to tell - it depicts a group of cowboy outlaws who hold up a train and rob the passengers. They are then pursued by a Sheriff's posse. Several scenes have color included - all hand tinted.",
"languages": [
"English"
],
"released": {
"$date": {
"$numberLong": "-2085523200000"
}
},
"directors": [
"Edwin S. Porter"
],
"rated": "TV-G",
"awards": {
"wins": {
"$numberInt": "1"
},
"nominations": {
"$numberInt": "0"
},
"text": "1 win."
},
"lastupdated": "2015-08-13 00:27:59.177000000",
"year": {
"$numberInt": "1903"
},
"imdb": {
"rating": {
"$numberDouble": "7.4"
},
"votes": {
"$numberInt": "9847"
},
"id": {
"$numberInt": "439"
}
},
"countries": [
"USA"
],
"type": "movie",
"tomatoes": {
"viewer": {
"rating": {
"$numberDouble": "3.7"
},
"numReviews": {
"$numberInt": "2559"
},
"meter": {
"$numberInt": "75"
}
},
"fresh": {
"$numberInt": "6"
},
"critic": {
"rating": {
"$numberDouble": "7.6"
},
"numReviews": {
"$numberInt": "6"
},
"meter": {
"$numberInt": "100"
}
},
"rotten": {
"$numberInt": "0"
},
"lastUpdated": {
"$date": {
"$numberLong": "1439061370000"
}
}
},
"num_mflix_comments": {
"$numberInt": "0"
}
}

This is a typical way data is structured in a NoSQL database. The data is in a JSON-like format and contains nested data. Now, let's look at what is loaded in BigQuery. Below you can see the same data in BigQuery.

BigQuery Data Overview

The ddl (data definition language) for the movies table in BigQuery can be seen below:

CREATE TABLE `dlthub-analytics.mongo_database.movies`
(
_id STRING NOT NULL,
plot STRING,
runtime INT64,
poster STRING,
title STRING,
fullplot STRING,
released TIMESTAMP,
rated STRING,
awards__wins INT64,
awards__nominations INT64,
awards__text STRING,
lastupdated TIMESTAMP,
year INT64,
imdb__rating FLOAT64,
imdb__votes INT64,
imdb__id INT64,
type STRING,
tomatoes__viewer__rating FLOAT64,
tomatoes__viewer__num_reviews INT64,
tomatoes__viewer__meter INT64,
tomatoes__fresh INT64,
tomatoes__critic__rating FLOAT64,
tomatoes__critic__num_reviews INT64,
tomatoes__critic__meter INT64,
tomatoes__rotten INT64,
tomatoes__last_updated TIMESTAMP,
num_mflix_comments INT64,
_dlt_load_id STRING NOT NULL,
_dlt_id STRING NOT NULL,
tomatoes__dvd TIMESTAMP,
tomatoes__website STRING,
tomatoes__production STRING,
tomatoes__consensus STRING,
metacritic INT64,
tomatoes__box_office STRING,
imdb__rating__v_text STRING,
imdb__votes__v_text STRING,
year__v_text STRING
);

If you compare the ddl against the sample document in MongoDB you will notice that the nested arrays such as CAST are missing from the ddl in BigQuery. This is because of how dlt handles nested arrays. If we look at our database in BigQuery you can see the CAST is loaded as a separate table.

BigQuery Table Overview

dlt normalises nested data by populating them in separate tables and creates relationships between the tables, so they can be combined together using normal SQL joins. All this is taken care of by dlt and we need not worry about how transformations are handled. In short, the transformation steps we discussed in Why is dlt useful when you want to ingest data from a production database such as MongoDB? are taken care of by dlt, making the data analyst's life easier.

To better understand how dlt does this transformation, refer to the docs.

3. Self-service analytics for MongoDB with Holistics.

After dlt ingests the data into your data warehouse, you can connect Holistics to the data warehouse and model, govern, and set up your self-service analytics platform for end-user consumption.

By combining dlt with Holistics we get the best of both worlds. The flexibility of an open source library for data ingestion that we can customize based on changing data needs, and a self-service BI tool in Holistics that can not only be used for analytics but also introduces a data modeling layer where metrics and data logic can be defined. Holistics also has support for Git version control to track code changes and can integrate with dbt for streamlining data transformations.

We took care of the data ingestion step in the previous section. We can now connect to our SQL data warehouse, and start transforming the data using the modeling layer in Holistics. We will be using the newest version of Holistics, Holistics 4.0 for this purpose.

In Holistics, add a new data source click on the plus sign (+) on the top menu, and then select Connect Data Sources. Select New Data Sources and in the database type select Google BigQuery. We need to provide the service account credentials that were generated above when we connected dlt to BigQuery. For more detailed instructions on connecting BigQuery to Hollistics refer to this guide.

Once the BigQuery source is added we are ready to import the schemas from BigQuery into Holistics. The schema(dataset_name) name under which dlt loaded the MongoDB data is defined in the load_entire_database function when we create the MongoDB pipeline.

# Create a pipeline
pipeline = dlt.pipeline(
pipeline_name="local_mongo",
destination='bigquery',
dataset_name="mongo_database", # Schema Name
)

4. Modeling the Data and Relationships with Holistics.

To use the data, we will define a data model and the join paths that Holistics can use to build the semantic datasets.

A data model is an abstract view on top of a physical database table that you may manipulate without directly affecting the underlying data. It allows you to store additional metadata that may enrich the underlying data in the data table.

In Holistics, go to the Modelling 4.0 section from the top bar. We will be greeted with the Start page as we have created no models or datasets. We will turn on the development mode from the top left corner. The development model will allow you to experiment with the data without affecting the production datasets and reporting. To keep things organized let’s create two folders called Models and Datasets.

Adding Holistics Data Model(s):

Under the Models folder, let's add the MongoDB data from BigQuery as Table Models. Hover over the Models folder and click on the (+) sign then select Add Table Model. In the Data Sources select the BigQuery Source we created before and then select the relevant table models to import into Holistics. In this case, we are importing the movies, movies_cast and movies_directors tables.

Holistics Add Model

Adding Holistics Dataset(s) and Relationships:

After the Data Models have been added, we can create a Dataset with these models and use them for reporting.

info

Dataset is a "container" holding several Data Models together so they can be explored together, and dictating which join path to be used in a particular analytics use case.

Datasets works like a data marts, except that it exists only on the semantic layer. You publish these datasets to your business users to let them build dashboards, or explore existing data.

Hover over the Datasets folder, click on the (+) sign, and then select Add Datasets. Select the previously created Table Models under this dataset, and Create Dataset.

Holistics Create Dataset

We will then be asked to create relationships between the models. We create a Many-to-one (n - 1) relationship between the cast and the movies models.

Add Relationship between Models

The resulting relationship can seen As Code using the Holistics 4.0 Analytics as Code feature. To activate this feature click on the newly created dataset and select the View as Code option from the top right. For more detailed instructions on setting up relationships between models refer to the model relationship guide.

Previously, we created the relationship between the cast and the movies tables using GUI, now let’s add the relationship between the directors and movies tables using the Analytics as Code feature. In the dataset.aml file append the relationships block with the following line of code:

relationship(model__mongo_database_movies_directors.dlt_parent_id > model__mongo_database_movies.dlt_id, true)

After the change, the dataset.aml file should look like this:

import '../Models/mongo_database_movies.model.aml' {
mongo_database_movies as model__mongo_database_movies
}
import '../Models/mongo_database_movies_cast.model.aml' {
mongo_database_movies_cast as model__mongo_database_movies_cast
}
import '../Models/mongo_database_movies_directors.model.aml' {
mongo_database_movies_directors as model__mongo_database_movies_directors
}

Dataset movies {
label: 'Movies'
description: ''
data_source_name: 'bigquery_mongo'
models: [
model__mongo_database_movies,
model__mongo_database_movies_cast,
model__mongo_database_movies_directors
]
relationships: [
relationship(model__mongo_database_movies_cast.dlt_parent_id > model__mongo_database_movies.dlt_id, true),
relationship(model__mongo_database_movies_directors.dlt_parent_id > model__mongo_database_movies.dlt_id, true)
]
owner: 'zaeem@dlthub.com'
}

The corresponding view for the dataset.aml file in the GUI looks like this:

Add Relationship GUI

Once the relationships between the tables have been defined we are all set to create some visualizations. We can select the Preview option from next to the View as Code toggle to create some visualization in the development mode. This comes in handy if we have connected an external git repository to track our changes, this way we could test out the dataset in preview mode before committing and pushing changes, and deploying the dataset to production.

In the current scenario, we will just directly deploy the dataset to production as we have not integrated a Git Repository. For more information on connecting a Git Repository refer to the Holistics docs.

The Movies dataset should now be available in the Reporting section. We will create a simple visualization that shows the workload of the cast and directors. In simple words, How many movies did an actor or director work on in a single year?

Visualization and Self-Service Analytics with Holistics:

The visualization part is pretty self-explanatory and is mostly drag and drop as we took the time to define the relationships between the tables. Below we create a simple table in Holistics that shows the actors that have appeared in most movies since the year 2000.

Holistics Create Visualization

Similarly, we can add other reports and combine them into a dashboard. The resulting dashboard can be seen below:

Holistics Dashboard

Conclusion

In this blog, we have introduced a modern data stack that uses dlt and Holistics to address the MongoDB data accessibility issue.

We leverage dlt, to extract, normalize, create schemas, and load data into BigQuery, making it more structured and accessible. Additionally, Holistics provides the means to transform and model this data, adding relationships between various datasets, and ultimately enabling self-service analytics for the broader range of stakeholders in the organization.

This modern data stack offers an efficient and effective way to bridge the gap between MongoDB's unstructured data storage capabilities and the diverse needs of business, operations, and data science professionals, thereby unlocking the full potential of the data within MongoDB for the entire Company.

Additional Resources:

· 10 min read
Anton Burnashev

tl;dr: In this blog post, we'll build a RAG chatbot for Zendesk Support data using Verba and dlt.

As businesses scale and the volume of internal knowledge grows, it becomes increasingly difficult for everyone in the company to find the right information at the right time.

With the latest advancements in large language models (LLMs) and vector databases, it's now possible to build a new class of tools that can help get insights from this data. One approach to do so is Retrieval-Augmented Generation (RAG). The idea behind RAGs is to retrieve relevant information from your database and use LLMs to generate a customised response to a question. Leveraging RAG enables the LLM to tailor its responses based on your proprietary data.

Diagram illustrating the process of internal business knowledge retrieval and augmented generation (RAG), involving components like Salesforce, Zendesk, Asana, Jira, Notion, Slack and HubSpot, to answer user queries and generate responses.

One such source of internal knowledge is help desk software. It contains a wealth of information about the company's customers and their interactions with the support team.

In this blog post, we'll guide you through the process of building a RAG application for Zendesk Support data, a popular help desk software. We’re going to use dlt, Weaviate, Verba and OpenAI.

dlt is an open-source Python library that simplifies the process of loading data from various sources. It does not requires extensive setup or maintenance and particularly useful for CRM data: highly tailored to the needs of the business and changes frequently.

Weaviate is an open-source, AI-native vector database that is redefining the foundation of AI-powered applications. With capabilities for vector, hybrid, and generative search over billions of data objects, Weaviate serves as the critical infrastructure for organizations building sophisticated AI solutions and exceptional user experiences.

Verba is an open-source chatbot powered by Weaviate. It's built on top of Weaviate's state-of-the-art Generative Search technology. Verba includes a web interface and a query engine that uses Weaviate database.

Prerequisites

  1. A URL and an API key of a Weaviate instance. We're using the hosted version of Weaviate to store our data. Head over to the Weaviate Cloud Services and create a new cluster. You can create a free sandbox, but keep in mind your cluster will expire and your data will be deleted after 14 days. In the "Details" of your cluster you'll find the Cluster URL and the API key.
  2. An OpenAI account and API key. Verba utilizes OpenAI's models to generate answers to user's questions and Weaviate uses them to vectorize text before storing it in the database. You can sign up for an account on OpenAI's website.
  3. A Zendesk account and API credentials.

Let’s get started

Step 1. Set up Verba

Create a new folder for your project and install Verba:

mkdir verba-dlt-zendesk
cd verba-dlt-zendesk
python -m venv venv
source venv/bin/activate
pip install goldenverba

To configure Verba, we need to set the following environment variables:

VERBA_URL=https://your-cluster.weaviate.network # your Weaviate instance URL
VERBA_API_KEY=F8...i4WK # the API key of your Weaviate instance
OPENAI_API_KEY=sk-...R # your OpenAI API key

You can put them in a .env file in the root of your project or export them in your shell.

Let's test that Verba is installed correctly:

verba start

You should see the following output:

INFO:     Uvicorn running on <http://0.0.0.0:8000> (Press CTRL+C to quit)
ℹ Setting up client
✔ Client connected to Weaviate Cluster
INFO: Started server process [50252]
INFO: Waiting for application startup.
INFO: Application startup complete.

Now, open your browser and navigate to http://localhost:8000.

A user interface screenshot showing Verba, retrieval and augmented generation chatbot, powered by Weaviate

Great! Verba is up and running.

If you try to ask a question now, you'll get an error in return. That's because we haven't imported any data yet. We'll do that in the next steps.

Step 2. Install dlt with Zendesk source

We get our data from Zendesk using dlt. Let's install it along with the Weaviate extra:

pip install "dlt[weaviate]"

This also installs a handy CLI tool called dlt. It will help us initialize the Zendesk verified data source—a connector to Zendesk Support API.

Let's initialize the verified source:

dlt init zendesk weaviate

dlt init pulls the latest version of the connector from the verified source repository and creates a credentials file for it. The credentials file is called secrets.toml and it's located in the .dlt directory.

To make things easier, we'll use the email address and password authentication method for Zendesk API. Let's add our credentials to secrets.toml:

[sources.zendesk.credentials]
password = "your-password"
subdomain = "your-subdomain"
email = "your-email@example.com"

We also need to specify the URL and the API key of our Weaviate instance. Copy the credentials for the Weaviate instance you created earlier and add them to secrets.toml:

[destination.weaviate.credentials]
url = "https://your-cluster.weaviate.network"
api_key = "F8.....i4WK"

[destination.weaviate.credentials.additional_headers]
X-OpenAI-Api-Key = "sk-....."

All the components are now in place and configured. Let's set up a pipeline to import data from Zendesk.

Step 3. Set up a dlt pipeline

Open your favorite text editor and create a file called zendesk_verba.py. Add the following code to it:

import itertools

import dlt
from weaviate.util import generate_uuid5
from dlt.destinations.adapters import weaviate_adapter

from zendesk import zendesk_support

def to_verba_document(ticket):
# The document id is the ticket id.
# dlt will use this to generate a UUID for the document in Weaviate.
return {
"doc_id": ticket["id"],
"doc_name": ticket["subject"],
"doc_type": "Zendesk ticket",
"doc_link": ticket["url"],
"text": ticket["description"],
}

def to_verba_chunk(ticket):
# We link the chunk to the document by using the document (ticket) id.
return {
"chunk_id": 0,
"doc_id": ticket["id"],
"doc_name": ticket["subject"],
"doc_type": "Zendesk ticket",
"doc_uuid": generate_uuid5(ticket["id"], "Document"),
"text": ticket["description"],
}

def main():
pipeline = dlt.pipeline(
pipeline_name="zendesk_verba",
destination="weaviate",
)

# Zendesk Support has data tickets, users, groups, etc.
zendesk_source = zendesk_support(load_all=False)

# Here we get a dlt resource containing only the tickets
tickets = zendesk_source.tickets

# Split the tickets into two streams
tickets1, tickets2 = itertools.tee(tickets, 2)

@dlt.resource(primary_key="doc_id", write_disposition="merge")
def document():
# Map over the tickets and convert them to Verba documents
# primary_key is the field that will be used to generate
# a UUID for the document in Weaviate.
yield from weaviate_adapter(
map(to_verba_document, tickets1),
vectorize="text",
)

@dlt.resource(primary_key="doc_id", write_disposition="merge")
def chunk():
yield from weaviate_adapter(
map(to_verba_chunk, tickets2),
vectorize="text",
)

info = pipeline.run([document, chunk])

return info

if __name__ == "__main__":
load_info = main()
print(load_info)

There's a lot going on here, so let's break it down.

First, in main() we create a dlt pipeline and add a Weaviate destination to it. We'll use it to store our data.

Next, we create a Zendesk Support source. It will fetch data from Zendesk Support API.

To match the data model of Zendesk Support to the internal data model of Verba, we need to convert Zendesk tickets to Verba documents and chunks. We do that by defining two functions: to_verba_document and to_verba_chunk. We also create two streams of tickets. We'll use them to create two dlt resources: document and chunk. These will populate the Document and Chunk classes in Verba. In both resources we instruct dlt which fields to vectorize using the weaviate_adapter() function.

We specify primary_key and write_disposition for both resources. primary_key is the field that will be used to generate a UUID for the document in Weaviate. write_disposition tells dlt how to handle duplicate documents. In our case, we want to merge them: if a document already exists in Weaviate, we want to update it with the new data.

Finally, we run the pipeline and print the load info.

Step 4. Load data into Verba

Let's run the pipeline:

python zendesk_verba.py

You should see the following output:

Pipeline zendesk_verba completed in 8.27 seconds
1 load package(s) were loaded to destination weaviate and into dataset None
The weaviate destination used <https://your-cluster.weaviate.network> location to store data
Load package 1695726495.383148 is LOADED and contains no failed jobs

Verba is now populated with data from Zendesk Support. However there are a couple of classes that need to be created in Verba: Cache and Suggestion. We can do that using the Verba CLI init command. When it runs it will ask us if we want to create Verba classes. Make sure to answer "n" to the question about the Document class — we don't want to overwrite it.

Run the following command:

verba init

You should see the following output:

===================== Creating Document and Chunk class =====================
ℹ Setting up client
✔ Client connected to Weaviate Cluster
Document class already exists, do you want to overwrite it? (y/n): n
⚠ Skipped deleting Document and Chunk schema, nothing changed
ℹ Done

============================ Creating Cache class ============================
ℹ Setting up client
✔ Client connected to Weaviate Cluster
'Cache' schema created
ℹ Done

========================= Creating Suggestion class =========================
ℹ Setting up client
✔ Client connected to Weaviate Cluster
'Suggestion' schema created
ℹ Done

We're almost there! Let's start Verba:

verba start

Step 4. Ask Verba a question

Head back to http://localhost:8000 and ask Verba a question. For example, "What are common issues our users report?".

A user interface screenshot of Verba showing Zendesk tickets with different issues like API problems and update failures, with responses powered by Weaviate

As you can see, Verba is able to retrieve relevant information from Zendesk Support and generate an answer to our question. It also displays the list of relevant documents for the question. You can click on them to see the full text.

Conclusion

In this blog post, we've built a RAG application for Zendesk Support using Verba and dlt. We've learned:

  • How easy it is to get started with Verba.
  • How to build dlt pipeline with a ready-to-use data source.
  • How to customize the pipeline so it matches the data model of Verba.

Where to go next?

  • Ensure your data is up-to-date. With dlt deploy you can deploy your pipeline to Google's Cloud Composer or GitHub Actions and run it on a schedule.
  • Build a Verba RAG for other data sources. Interested in building a RAG that queries other internal business knowledge than Zendesk? With dlt you can easily switch your data source. Other dlt verified sources of internal business knowledge include Asana, Hubspot, Jira, Notion, Slack and Salesforce. However, dlt isn’t just about ready-to-use data sources; many of our users choose to implement their own custom data sources.
  • Learn more about how Weaviate works. Check out Zero to MVP course to learn more about Weaviate database and how to use it to build your own applications.
  • Request more features. A careful reader might have noticed that we used both Document and Chunk classes in Verba for the same type of data. For simplicity's sake, we assumed that the ticket data is small enough to fit into a single chunk. However, if you if you're dealing with larger documents, you might consider splitting them into chunks. Should we add chunking support to dlt? Or perhaps you have other feature suggestions? If so, please consider opening a feature request in the dlt repo to discuss your ideas!

Let's stay in touch

If you have any questions or feedback, please reach out to us on the dltHub Slack.

· 6 min read
Adrian Brudaru

I recently reviewed someone’s anonymized CV on Reddit as they complained they applied to dozens of jobs but never heard back. Shortly after, my inbox blew up with review requests.

My experience comes from 10+ years in the industry, and having reviewed thousands of CVs myself, having hired for multiple companies. I also asked the opinion of other hiring managers who have hired data engineers.

Why European and US companies? Because, in my experience, other parts of the world work more in enterprises than startups or SMEs, and as a consequence they work with different tool sets. They often do “outsourcing” work, often on technologies used 5-20 years ago and less modern techs. Even when the tech used is new, the role is usually very limited in an enterprise and generally lacks the end-to-end capabilities requirement. For example, it might mean doing drag and drop tools + SQL stored procedures, or it might mean doing only data loading without architecture and generally without an orchestrator, etc.

So, let’s cut to the chase - what is a hiring manager looking for in an application?

Cover letter

Let’s start with the shortest bit - cover letter. Definitely write one. If you apply by email, put it in the email body. If not, attach it to the application.

What should it contain? A couple of paragraphs about

  • why this company you are applying for is interesting to you
  • your motivation for applying and what you hope to achieve here
  • why you would be a good fit for the role.

Keep it honest. Write about the role and your intent, do not make it about you.

A definite don’t is a generic letter that talks about you, but doesn’t mention the role or company. This shows a lack of investment in the application and gives wrong signals. So if you cannot review the company and say why you like it, then the reviewer might feel the same way about you.

The CV

Rules of thumb:

  • Space is your friend. It drives focus to what matters. Don’t fill it with junk.
  • “Soft skills” are important. Show the human behind the role, and have a bio section that should contain info about you.
  • Focus shows consideration to the reviewer. Tell me what’s relevant.
  • If you are junior, demonstrate potential. List your initiatives, learning, etc.
  • If you are senior, demonstrate results. Focus less on techs and more on outcomes.
  • Use a two-pager if you can. First page - everything else. Second page: your experience. If you do not have experience, stick to a one-pager. If you are in the US, you might be required to use a one-pager, in which case make your job descriptions shorter and more focused, dropping the non-relevant.

The parts

  1. Bio: name, where you come from, your values.

    • Image. You are a human talking to a human, don’t make it less, it doesn’t help. Be confident, friendly, smile.
    • “Values”. Your values are a significant predictor of your “direction”, behavior and interactions. List the top 3, not more. No BS.
    • “Motivation”. It indicates what is the fuel to your fire so the company knows if they can do that for you. One sentence.
    • “Mission” or what you are looking to do, indicates if your goals may align with role goals. One or two sentences max.
  2. Skills: The easiest thing to filter on, so list the relevant skills for the role you are applying. Do not list the entire list of something you have ever used or tried. This only creates noise and the impression that you don’t have any practiced skills. Mention the skills that are relevant first, and consider skipping the rest. For example: “Python, SQL, orchestrator, GitHub, others.” instead of “Java, Javascript, Ruby, Go, R, C++, SQL, Python, Tableau, Excel, Looker, BigQuery, Bash, Basic, Assembly, MsWord, Google Sheets, PowerBI, ” Make sure those skills are clearly displayed and optionally also mentioned in the listed jobs so people can judge how much you used them.

    You can personalize the list based on the job you apply for. If skills are listed outside the job description, look at who else works or worked there, they might list the tools or skills on their LinkedIn.

    If you are starting and need to gain the skills, make them happen. Take a course, learn them, and list them.

  3. Job experience. Format:

    What are we presenting? The context (company, team), the achievements (X, Y, Z), the takeaways (A, B, C, X, Y, Z experience)

    What are we not presenting? Things are irrelevant to the role we are applying for or our values. Less is more as it shifts focus to the important. Also, no vanity achievements or internal company awards, which are often offered in consultancies. These are just tools to motivate you, and nobody cares externally.

  4. Education, studies, self-studies: Some groups care about the school. It is less critical, as the school does not teach real-life data engineering, but it will help you understand where you come from.

  5. Projects, courses, volunteering: I care more about the demonstrated learning experience. I want to see that you keep learning. If not on the job, then I want to see courses or projects here. Volunteering also paints a picture of a “doer”.

The skills you need

This depends on the role, but most DE roles building standard platforms need the following:

  • Python
  • orchestrator (Airflow usually)
  • dimensional modeling and other data architectures
  • SQL
  • software best practices - versioning, cicd, testing, etc.

And the magic sauce…

… goes into the food, keep it out of the CV :) There’s no magic here, just clear communication.

Your three levers are

  • better communication
  • better skills
  • more applications

The rest is a matter of opportunities, depending on your location and the job market. If the opportunities are not available, consider entering the field via an adjacent role like an analytics engineer.

Screening CVs is a rapid process, so you need to ensure you check off the requirements while removing (not adding) any doubts. Any hire is a risk, so the screener wants to minimize that - help them do it by showing a good understanding of your role and what is expected of you. Don’t add info that isn’t needed, as that might add doubts.

I wish you good luck on your application process and encourage you to ask for feedback from the community if you get stuck.

· 7 min read
Adrian Brudaru
info

💡 TIL: BSON stands for binary JSON, not for BS Object Notation /s

What will you learn in this article?

The scope of this article is to look at and discuss the process of extracting data from MongoDB and making it available in a SQL store. We will focus on the difficulties around ingesting the data into a SQL database, while we will look at MongoDB only from a source perspective.

The focus is the data in its different states, not the underlying technologies.

Why the harsh title?

The title may sound harsh, but it accurately reflects the challenges often encountered when dealing with MongoDB.

Also referred to as /dev/null, Mongo offers a flexible schema and document-based storage can lead to data inconsistencies and complexities, resembling a metaphorical "dumpster" where data might be scattered and difficult to organise.

Analytical consumers of MongoDB will be forced to invest extra effort in data modelling, schema design, and querying optimisation to ensure data quality and retrieval efficiency.

Is this a common problem?

It's inevitable. An analytical system has multiple requirements which force us to move data from places such as MongoDB to a SQL store.

Let’s look at those requirements:

  • Business User access: Most data is used by business users to improve operations. Business users access data via dashboards, or pivot-table like interfaces that enable them to do custom aggregations. The tooling that exists to do this is created for SQL, as a place where relational queries can be executed.
  • Ecosystem of integrations and tools: In analytics, having a diverse ecosystem of integrations and tools is crucial. SQL databases seamlessly fit into this ecosystem, offering compatibility with a wide array of data warehousing, data integration, and data governance tools. This comprehensive ecosystem enhances the analytics infrastructure, ensuring that data can be efficiently managed, transformed, and accessed by various stakeholders.
  • Standardization for consistency: Maintaining data consistency is paramount in analytics. SQL's widespread adoption and standardized query language enable analysts and data professionals to work with data consistently across different systems and platforms. This standardization ensures that data is interpreted and manipulated uniformly, reducing the risk of errors and discrepancies in analytical processes.
  • Data transformation & modelling capabilities: Effective data transformation and modelling are prerequisites for meaningful analytics. SQL provides a robust toolkit for these tasks, enabling data professionals to perform complex operations such as joins, filtering, aggregation, and intricate calculations. These capabilities are essential for preparing raw data into structured formats that can be readily used for in-depth analysis, reporting, and decision-making in the analytics domain.

So, after looking at what is needed for analytics, it becomes clear that going off the beaten path will lead to some pretty gnarly limitations and outcomes.

Mongo in particular: BSON vs JSON

How is Mongo different from semi-structure data like JSON, and is MongoDB particularly hard to ingest from?

Bson is for performance, json is for transmission.

The differences stem from the fact that MongoDB uses BSON under the hood, as opposed to JSON. BSON is a binary object notation optimised for performance, while JSON is a standard interchange format.

Similarly, Mongo also supports custom and more complex data types, such as geospatial, dates, regex, etc, that json does not. Additionally, BSON supports character encodings. All these benefits enable MongoDB to be a faster and better database, but the cost is additional hurdles that must be crossed before we can use this data elsewhere.

So how do you solve these issues? Well, hopefully your development team didn't go overboard, and you can just simply convert the BSON to JSON. If you are unlucky, you will need to create your own mappers that follow whatever your team did.

From JSON to DB

Once you have converted your mongo BSON into JSON, you are able to use its wide support to have it ingested.

JSON enjoys widespread support across various data processing tools and systems, making it a versatile choice for data ingestion. With your data in JSON, you can seamlessly integrate it into your database, leveraging its compatibility to efficiently manage and analyze your information.

Cleaning and typing

Data typing is essential in ensuring data integrity. It involves assigning appropriate data types to JSON fields, like converting numerical values into integers or floats, representing dates as datetime types, and labeling text data as string data types. This step guarantees that the database accurately stores and processes information.

Do we unpack?

The choice between unpacking nested JSON into tables or keeping it as JSON depends on your specific needs. Unpacking enhances query performance, indexing, and data manipulation within relational databases. However, native JSON support in some databases can suffice for simpler scenarios, preserving the original hierarchical structure. Your decision should align with data analysis, retrieval requirements, and your chosen database's capabilities.

Simply put, if you plan to use the data, you should probably unpack it to benefit from what relational dbs have to offer. But if you simply need to store and retrieve the json, do not convert it.

Querying unpacked data is cheaper and more robust than maintaining wet json_extract() code

Unpacking nested JSON into separate tables within a relational database is essential for robustness and query efficiency. Relational databases are optimized for tabular data and typed columns, making it challenging and error prone to handle complex nested structures directly.

By breaking down nested JSON into separate tables and establishing relationships through foreign keys, the data becomes more structured, ensuring robust data management and enhancing query efficiency. This simplification streamlines data retrieval and manipulation, aligning it with standard SQL operations for efficient and effective use.

Start using dlt to load Mongo to SQL today

To help with the challenges of loading Mongo data, we created a dlt source that reads your mongo collections and throws flat sql tables on the other side.

The benefit of using dlt is that you get flat tables in your sql database that adapt to match the Mongo schema.

Here's a code explanation of how it works under the hood:

  1. It grabs data from Mongo and turns it into JSON.

  2. From json, dlt leverages schema inference and evolution to make sense of the data. Here is an example of how this nested data could look:

    data = {
    'id': 1,
    'name': 'Alice',
    'job': {
    "company": "ScaleVector",
    "title": "Data Scientist",
    },
    'children': [
    {
    'id': 1,
    'name': 'Eve'
    },
    {
    'id': 2,
    'name': 'Wendy'
    }
    ]
    }
  3. We can load the data to a supported destination declaratively:

    import dlt

    pipeline = dlt.pipeline(
    pipeline_name='from_json',
    destination='duckdb',
    dataset_name='mydata',
    full_refresh=True,
    )
    # dlt works with lists of dicts, so wrap data to the list
    load_info = pipeline.run([data], table_name="json_data")
    print(load_info)
  4. Now we can use the data, these are two tables:

    json_data

    indexidnamejob__companyjob__title_dlt_load_id_dlt_id
    01AliceScaleVectorData Scientist1693922245.6026670ZbCzK7Ra2tWMQ

    json_data__children

    indexidname_dlt_parent_id_dlt_list_idx_dlt_id
    01Eve0ZbCzK7Ra2tWMQ0TjzpGZ+dwrrQhg
    12Wendy0ZbCzK7Ra2tWMQ1RdqpN1luoKxQTA

    Note that the original json got unpacked into tables that are now joinable via generated keys child._dlt_parent_id = parent._dlt_id.

Read more about it here: Mongo verified source.

What are you waiting for?

· 6 min read
Adrian Brudaru
info

PSSSST! You do ELT, right? not ETL? asking for a friend...

ETL vs ELT? A vendor driven story.

One of the earliest tooling for "ETL" data was Pentaho Kettle. Kettle stands for "Kettle Extraction Transformation Transport Load Environment" and signifies that it transforms the data before loading it. It was usually used to load data which was later transformed in SQL via "SQL scripts", while still in the tool, or via database triggers or views outside of the tool.

Indeed, the tool creators imagined some folks would write java to transform before loading, but the vast majority of data users just wanted to use SQL.

Sounds familiar? This is not so different to today's "ELT", is it?

Why did we call it ELT?

The people

Well, first of all SQL is much more accessible and very powerful for transforming tables, columns and rows - where programming handles single values. So before purpose built tooling existed, data people were already doing the transform in SQL - it just made sense.

The "EL" vendors

In the decade following Pentaho, Saas solutions started offering pipelines that load data into your database, removing the option for you to tinker with it before loading. For this reason, they would call it "ELT".

The db vendors

The concept also resonated with MPP DBs (massive parallel processing), such as Snowflake, Redshift, Bigquery, which were more than happy to encourage doing all the compute on their side.

The "T in ELT" vendors

Another puzzle piece was dbt, a tool purpose built for SQL transform. So if there's a question of ETL or ELT, dbt can only answer ELT. In dbt's word view, data starts dirty in your warehouse, where you "rename, cast, join, enrich" - a true ELT. To make the drudgery of data cleaning in SQL easier, dbt offers some python support to enable generating some of the typing and renaming SQL. They also offer a litte bit of python support for scalar operations in some db vendor systems.

What do we really do?

Most of us do a little bit of both - we extract with python, and the next steps are loading, cleaning and curation. In some cases, cleaning and curation are optional. For example, when we load a report from another platform we will probably not need to clean or curate anything.

Where do we clean data?

Data cleaning usually refers to normalising the data into correct types, usable names, etc. Doing this in SQL results in writing a lot of manual code that needs to be maintained. On the other hand, sturcturing data in python isn't easy either, it's just less technically difficult, but when metadata is missing, it becomes guesswork.

So, technically the easier place to clean data is in python, but likely the majority will do it in SQL as they are more practiced in SQL.

Where do we transform data?

When it comes to working with tables, SQL is still the better place to be. Joins and aggregations are the core operations that will happen here and they would be much harder to handle scalably in python.

dlt puts the small t back in EtlT, let's see how.

So, python is still superior at a few operations

  • Typing, renaming, normalising, unpacking
  • complex scalar operations

While we will leave the aggregations and joins to the big T, SQL.

Normalisation, typing, unpacking

dlt does this well out of the box. Automatic typing, renaming, flattening, and ddl deployment are all handled by the schema inference and evolution engine. This engine is configurable in both how it works and what it does, you can read more here: Normaliser, schema settings

Here is a usage example (it's built into the pipeline):


import dlt

# Json, dataframes, iterables, all good
# the data will be auto typed and normalised
data = [{'id': 1, 'name': 'John'}]

# open connection
pipe = dlt.pipeline(destination='bigquery',
dataset_name='raw_data')

# self-explanatory declarative interface
job_status = pipe.run(data,
write_disposition="merge",
primary_key="id",
table_name="users")

# optionally load schema and metadata
pipe.run([job_status],
write_disposition="append",
table_name="loading_status")

Scalar operations

Sometimes we need to edit a column's value in some very specific way for which SQL doesn't quite cut it. Sometimes, we have data we need to pseudonymise before loading for regulatory reasons.

Because dlt is a library, it means you can easily change how the data stream is produced or ingested. Besides your own customisations, dlt also supports injecting your transform code inside the event stream, see an example here

Here is a code example of pseudonymisation, a common case where data needs to be transformed before loading:

import dlt
import hashlib

@dlt.source
def dummy_source(prefix: str = None):
@dlt.resource
def dummy_data():
for _ in range(3):
yield {'id':_, 'name': f'Jane Washington {_}'}
return dummy_data(),

def pseudonymize_name(doc):
'''
Pseudonmyisation is a deterministic type of PII-obscuring
Its role is to allow identifying users by their hash,
without revealing the underlying info.
'''
# add a constant salt to generate
salt = 'WI@N57%zZrmk#88c'
salted_string = doc['name'] + salt
sh = hashlib.sha256()
sh.update(salted_string.encode())
hashed_string = sh.digest().hex()
doc['name'] = hashed_string
return doc


# 1. Create an instance of the source so you can edit it.
data_source = dummy_source()
# 2. Modify this source instance's resource
data_source = data_source.dummy_data().add_map(pseudonymize_name)
# 3. Inspect your result
for row in data_source:
print(row)
#{'id': 0, 'name': '96259edb2b28b48bebce8278c550e99fbdc4a3fac8189e6b90f183ecff01c442'}
#{'id': 1, 'name': '92d3972b625cbd21f28782fb5c89552ce1aa09281892a2ab32aee8feeb3544a1'}
#{'id': 2, 'name': '443679926a7cff506a3b5d5d094dc7734861352b9e0791af5d39db5a7356d11a'}

pipeline = dlt.pipeline(pipeline_name='example', destination='bigquery', dataset_name='normalized_data')
load_info = pipeline.run(data_source)

The big T

Finally, once you have clean data loaded, you will probably prefer to use SQL and one of the standard tools. dlt offers a dbt runner to get you started easily with your transformation package.

pipeline = dlt.pipeline(
pipeline_name='pipedrive',
destination='bigquery',
dataset_name='pipedrive_dbt'
)

# make or restore venv for dbt, using latest dbt version
venv = dlt.dbt.get_venv(pipeline)

# get runner, optionally pass the venv
dbt = dlt.dbt.package(
pipeline,
"pipedrive/dbt_pipedrive/pipedrive", # or use public git "https://github.com/dbt-labs/jaffle_shop.git"
venv=venv
)

# run the models and collect any info
# If running fails, the error will be raised with full stack trace
models = dbt.run_all()

#optionally log dbt status
pipeline.run([models],
write_disposition="append",
table_name="_models_log")

In conclusion

ETL vs ELT was never really a debate. With some exceptions almost everyone transforms the data in SQL - but what they call this process depends on who's telling the story.

While it's easier to do most of the transformation in SQL, the tedious is completely automatable in python, and the dirty data doesn't need manual normalisation. With dlt, you can do ETL or ELT, or even better, both, as EtLT

Or, if you're feeling funny, you can add duckdb in the middle and go full EtLTLT where you have an additional T step in the middle for the kinds of operations that could be done locally. And afterwards you could load to operational systems to add one more L to the name :)

Fundamentally, we all agree it's all ETL, with the flavors simply designating specific sub-types.

Start using dlt today

What are you waiting for?

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.