Avocado Datalake

Avocado Datalake simplifies data management for your organization.

We are a data lake consultancy specializing in transforming raw data into actionable insights. Our end-to-end solutions encompass data ingestion, storage, management, and discovery. We seamlessly integrate data from diverse sources including MySQL, Amazon Aurora, Cloud SQL, Spanner, and Apache Kafka and MongoDB into centralized repository (data lake) based on the cloud storage such as AWS S3 or Google Cloud Storage. By leveraging Apache Hudi, Delta lake or Apache Iceberg as an open table storage format that will ensure CDC capture and read and write operations.

To maximize the value of your data lake, we implement advanced metadata management using AWS Glue Data Catalog, Unity Catalog, or GCP Data Catalog. This enables seamless data discovery and analysis through tools like Amazon Athena, Presto, Apache Airflow, Looker and Looker Studio. Our expertise extends to data governance and security, providing best practices for table access and permissions using AWS Lake Formation.

Further more we can connect your data lake storage data into enterprise data warehouse such as Amazon Redshift and BigQuery

We partner with organizations to unlock the full potential of their data and drive data-driven decision making.

Avocado Datalake Architecture

A high level design architecture of our propose solution for your organization to manage all sources of data into a unified data lake.

Sources

source MySQL
source AWS Aurora DB
source GCP Cloud SQL
source GCP Cloud Spanner
source Parquet files
source can be Kafka
source json files on S3 or GCS
source csv on S3 or GCS

Table Formats

table format Apache Hudi
Table format Deltalake from Databricks
Table format Apache Iceberg

Data Discovery

Data Discovery in AWS Athena Airflow
Data Discovery in  Apache Airflow
Data Discovery in  Deltalake from Databricks
Data Discovery in Deltalake from Databricks
Data Discovery in  Deltalake from Databricks
Data Discovery in Deltalake from Databricks
Data Discovery in  Deltalake from Databricks
Any source can be easily accomodated for ingestion
Any table format be switch easily in our pre build codebase
We can attached any Data Discovery tools to our Data lake though hive style metadata cataloging