The overarching aim of DataflowOpt is to capitalize on significant preliminary results in the area of dataflow optimization by the investigators and reshape the way dataflow optimization is approached addressing the significant limitations, state-of-the-art solutions suffer from.

The broader technical goal is to make cost-based automated dataflow optimization functionality available to designers with a view to facilitating and strengthening the adoption of advanced analytics by a much broader community in practice, from SMEs to big scientific laboratories.

An additional goal of this proposal is to avoid re-inventing the wheel and provide concrete proof-of-concept technical results regarding the confluence of Business Process Management (BPM) and big data management methods for process re-engineering through transferring techniques from dataflow optimization.

The expected key results of the project include:

  • Extensions to current optimization algorithms, so that they can work efficiently in combination.
  • Development of cost models that better reflect execution time, resource consumption, throughput, monetary cost and latency in both dataflows and business processes to both describe flow execution and drive optimizations.
  • Development of efficient semi-automated techniques for acquisition of the necessary metadata, with focus on task cost per invocation, selectivity and dependency constraints.
  • Concrete proof-of-concept regarding the value in blending together dataflow and business process optimization in a way that avoids re-inventing the wheel and broadens the scope of flow optimization proposals.
  • Successful application of the DataflowOpt techniques to all case studies envisaged to yield performance benefits of several factors and satisfaction of KPI goals.

Target Outcomes

  1. Novel algorithms: the first main result category of the solutions consists of algorithms, which are further divided into two classes. (A) DAG optimization techniques, which will cover the combinations of at least 3 KPIs and 2 optimization types, 1 higher-level (task re-ordering) and 1 lower-level (exact implementation and location decision among several alternatives). The 3 KPIs will refer to resource consumption, running time in a parallel environment and combination of monetary cost and running time. All algorithmic solutions will be of low polynomial complexity in the number of DAG vertices and the number of alternatives (the brute force solutions are intractable). (B) Metadata acquisition, which aims to provide a systematic way to extract the statistical and dependency information required by the algorithms in (A). A key characteristic of the proposed solutions is that they are transferred from advanced data management to business process optimization.
  2. Complete system prototype: we aim to incorporate the solutions into a real system, examining the alternatives between Spark and PDI. PDI also accounts for parts of the flow running on Spark and thus is our first option. Both systems to be extended are open-source and the prototype will be open source as well.
  3. Thorough Evaluation: the effectiveness of the solutions will be verified through extensive experimentation, using established benchmarks, such as TPC-DI, TPC-DS, BigBench and business process flows.


1.Kougka, G., Gounaris, A. & Simitsis, A. The many faces of data-centric workflow optimization: a survey. Int J Data Sci Anal 6, 81–107 (2018). [link] [pdf] (The proposal was partially based on this publication)


Anastasios Gounaris, Assistant Professor
Kostas Tsichlas, Assistant Professor
Georgia Kougka, Post-Doc Researcher


Project Coordinator
Anastasios Gounaris, Assistant Professor


Title: DataflowOpt

Project No: HFRI-FM17-1052

Duration: -

Funded under: HFRI