In the latest blog post in our data infrastructure series, we’ll look more closely at two important aspects of
a cloud-based, data infrastructure system: data pipelines and orchestration.
Data pipelines automate the flow of data, enabling teams to derive valuable insights and make data-driven decisions. Pipelines consist of a series of processes to move data from a source system to a target system. The most common data pipeline model involves extracting data from various sources, transforming the data to standardized formats, and loading the data into the target storage system. This extraction, transformation, and loading process – ETL – eases data analysis across disparate systems at scale, significantly improving efficiency, scalability, and data quality.
Integrating data into a common format can be processing intensive and challenging, depending on the heterogeneity of the source data. It is also important that the target format is optimized appropriately for the intended workloads. Primary considerations include the complexity of your source data, the tools you will use to access that data, and the ways in which you plan to aggregate and query your data. You also need to consider the tradeoffs between the resources needed to transform data and the overall performance of the pipelines.
For analytics use cases, common data formats are Parquet, Avro, and Arrow. Parquet, for example, is a columnar format optimized for analytics, while Avro is a row-based format that performs better when all fields need to be accessed and also provides robust support for schema evolution. Careful selection of data formats can improve the performance and efficiency of your pipelines and your data access tools.
You may also need to store data in multiple formats at the same time to optimize for different workloads. While this may sound excessive, storage is often cheaper than the compute and time required to run the data pipelines in real-time as the data is needed. For example, if you’re storing your data in .bag files, accessing topic data in aggregate for analysis across multiple logs requires a significant amount of data processing and time. If you only store .bag files, every occurrence of aggregated analysis requires these resources. However, if you run an appropriate data pipeline once and store the data in a format that is optimized for aggregated analysis, you will only have to pay for the resources for one pipeline run and the associated storage. The data will be available for ongoing analysis without additional processing costs.
Managing the data lifecycle for all of the various data representations can be a tricky but necessary optimization for long-term infrastructure efficiency. We will cover this topic in a future blog post in this series.
Orchestration of data pipelines presents a separate challenge. Orchestration tool configuration can be a complex task requiring dependency management and infrastructure integration. Several workflow tools on the market seek to solve this problem for organizations. There are well-known, open-source tools such as Airflow and Luigi, and managed solutions such as Cadence, Dagster, and Astronomer. The tradeoffs, however, are between cost, ease of configuration, management, and scalability – some tools are more cost-effective (or even free) but require a heavy lift from your team to configure, support, and integrate with your infrastructure in a scalable, reliable way. Others provide simple configurations for infrastructure integration but are more expensive to license.
Regardless of the solution selected, your team needs to perform the data engineering tasks to build the pipelines, including determining appropriate data formats, storage, and access tools to support required workloads and writing the code to perform any required translation.
At Model-Prime, we provide a turn-key solution for robotics teams that includes built-in data pipelines that are purpose-built for robotics development. Our platform provides analytics capabilities but removes the burden of configuration, data engineering, and scaling from your team. Find more details about how you can use our platform at https://docs.model-prime.com, or contact us for a demo.