Youprep: Why it's easier than SQL

Written by Toucan Toco | Feb 16, 2022 11:09:51 AM

Toucan Toco is a Guided Business Analytics solution. It helps large audience of business users to take better decision with clear and simple Data Apps, in various industries around the world.
In addition to dashboarding, Toucan provides simplified data preparation features to enable business oriented data transformation for non technical users.
As most no-code platforms, Toucan actually began with a “low-code” approach. The journey to “no-code” was full of interesting steps we’d like to share with you today.

An interesting observation

Back in the day, when Toucan wasn’t providing “no-code” data preparation, coding skills were required, for transforming and querying data. More specifically, Python Scripts & Mongo Queries had to be written to build any application or report. Twist was, those were not only written by formally trained developers but also by Customer Success Managers that learned this skill on the job.
Even if the coding tasks could be unappealing to them, Toucan’s product specialists managed to get the job done on non-trivial use cases where they could have struggled a lot more if they had to do it another way, with large, nested SQL queries for example. The successful adoption of this paradigm in application building was a no-brainer for us: there was something more to do with it as it enabled more with Python Scripts & Mongo Queries that was not available in good old SQL queries.

A battle-tested data preparation paradigm

In a nutshell, the Toucan’s way of doing data preparation was always based on pipelines, a series of simple transformation steps. Back in time, those transformation steps were either Python functions or Mongo Aggregation Pipeline Stages.

A transformation pipeline looked like this:

PIPELINE: [
  {
    input_domains: ["teams"]
    function_name: "prepare_team_requester"
    output_domain: "teamrequester"
  },
  {
    input_domains: ["github_base_dataset_raw"]
    function_name: "prepare_base_dataset"
    output_domain: "github_base_dataset"
  }
]

This snippet is an extract of the data preparation configuration file orchestrating the app’s data preparation

And it made use of steps define as Python functions like this:

def prepare_team_requester(df):
    try:
        df['teams'] = df['teams'].fillna("[]").apply(eval)
    except TypeError:
        pass
    return df


def prepare_base_dataset(dataset):
    try:
        dataset['PR Type'] = dataset['PR Type'].fillna("[]").apply(eval)
    except TypeError:
        pass
    try:
        dataset['teams'] = dataset['teams'].fillna("[]").apply(eval)
    except TypeError:
        pass'
    return dataset

This snippet is an extract of an “augment” file containing Python functions used in pipeline

These configuration files, defined the “upstream” data preparation. But data visualizations require specific Data Model in input, thus a “post-process” transformation pipeline similar to the “pre-process” transformation pipeline was also available to enable application builders to do the last kilometer of their data preparation.

After few years doing application building leveraging this “hand-crafted” approach based on code, the process proved to be very flexible. However, often, when the app building falls apart, because of mistakes in the pipeline or changes in the data source, app builders were left alone with their coding skills only. Moreover, training non developers to build pipeline was ok at first but proved to be less and less managable as we scaled up.

YouPrep: pipelines for everyone

When data prep becomes playful ! We @Toucan loved the trial & error approach so it’s seemed to be the perfect user experience to offer when handling data operations. For us, the key aspects to make it playful are:

Always display the data when editing the pipeline → No need to navigate between screens to see what is the data model.
Give visual feedback each time a transformation step is added/modified on a subset of data → No need to ask for a preview.
Better error recovery: if a step breaks the pipeline (either from human mistake or system failure), removing it is pretty straight forward unlike reformatting a deeply nested sql query.
Enable columns suggestion & auto-completion in steps configuration → No need to scroll on the data preview screen to see the exact syntax of a column.

All of this on top of:

Providing the most common data transformation steps as well as more specific ones that are essential to some Toucan’s charts data model. For example, we have a cool waterfall chart that delight investors relations managers or financial controllers. Its data model is quite complicated at first but guess what ? We have a dedicated waterfall step to make their life way simpler!
Providing types indications on columns → Helps the application builder to understand that a conversion is needed for further operations. For example, date operations.
Copy/Paste: the application builder can easily replicate an existing step to greatly speed up the pipeline’s design
What if the application builder doesn’t find the step he’s looking for ? We style provide a “custom” step allowing to code the step in the pipeline’s language flavor (Mongo or SQL for now).

Below is a 30sec demonstration of a pipeline’s design, the step by step preview and the edition of step:

Implementing Youprep, allowed to completely get rid of python scripts to build data transformation pipelines. For us it was a huge progress in our relations with application builders and onboarding managers.

Educational UX is better for everyone

At first, YouPrep may seem unimpressive to BI practitioner that are familiar with SQL or other database query language. Until, they get used to the step by step preview allowing you to see the shape & content of the dataset your working with at any point of the pipeline. By comparison, getting the same workflow in any usual SQL editor requires painfully commenting out part of your query.

It may seems like a gimmick at first when creating simple data pipelines. However, when you start creating and, most importantly, maintaining larger ones, either created by your “past self” or harder: someone else, it instantaneously became a must have!

To back this statement, we have the testimony of François D., a Software Engineer as well as one of our loyal customers and expert user of YouPrep. In his Toucan Toco application, he uses a varied mix of graphs with filters, fed by many different queries.

Almost all of them have been built via YouPrep. For example, one of these queries is built using the steps:

Convert text column to date
Add missing dates
Set missing values to 0
Extract the first day of the week of a date
Rename date column to "daily" and week column to “weekly”
Group by on a column whose name is set via a field selected by the user (via filters)
finally, rename the requester’s column to a fixed value usable in the construction of the graph

François tried to do this data preparation directly with an SQL query but was quickly confronted with problems that were not insoluble but slowed down greatly the creation of the pipeline. For example, what is the syntax of the date extraction function? Which query structure to adopt, nested queries or Common Table Expression? How to implement the missing dates function?
Assuming that these problems have been solved, how can we train other colleagues potentially less experienced with SQL on pipeline maintenance?

He quickly felt that using YouPrep was a better answer to solve these issues. According to his words building the pipeline with YouPrep was done quite easily and without having to look at any public documentation.

François concluded his testimony by few words that made us proud of what we built:

Building data pipelines with YouPrep is like switching from a command line interface to graphical user interface. YouPrep is the graphical interface for SQL - François D. (Software Engineer)

Wrapping up

Thanks to a large sample of technical & non-technical customers and partners testimonies, we’re confident enough to share with the world that YouPrep, the result of our journey to provide a no-code data preparation tool, is a success. It came a long way since it’s first release in October 2019, with a lot of improvement over the years and we’re still actively working on making it the easiest no-code data preparation tool out there. YouPrep has a lot of cool other features we didn’t mention in this article such as “real” date management, handling of huge data volumes (a billion or rows or more), etc…

Cherry on the cake, YouPrep’s backbone is an open source project since day one, it’s called “Weaverbird (Toucan Toco's data pipelines toolkit)”. In an upcoming article, we’d like to share more about this choice and have an “under the hood” tech overview of this awesome project.

View full post