DataflowOpt

DataflowOpt

The overarching aim of DataflowOpt is to capitalize on significant preliminary results in the area of dataflow optimization by the investigators and reshape the way dataflow optimization is approached addressing the significant limitations, state-of-the-art solutions suffer from.

The broader technical goal is to make cost-based automated dataflow optimization functionality available to designers with a view to facilitating and strengthening the adoption of advanced analytics by a much broader community in practice, from SMEs to big scientific laboratories.

An additional goal of this proposal is to avoid re-inventing the wheel and provide concrete proof-of-concept technical results regarding the confluence of Business Process Management (BPM) and big data management methods for process re-engineering through transferring techniques from dataflow optimization.

The expected key results of the project include:

  • Extensions to current optimization algorithms, so that they can work efficiently in combination.
  • Development of cost models that better reflect execution time, resource consumption, throughput, monetary cost and latency in both dataflows and business processes to both describe flow execution and drive optimizations.
  • Development of efficient semi-automated techniques for acquisition of the necessary metadata, with focus on task cost per invocation, selectivity and dependency constraints.
  • Concrete proof-of-concept regarding the value in blending together dataflow and business process optimization in a way that avoids re-inventing the wheel and broadens the scope of flow optimization proposals.
  • Successful application of the DataflowOpt techniques to all case studies envisaged to yield performance benefits of several factors and satisfaction of KPI goals.

Target Outcomes

  1. Novel algorithms: the first main result category of the solutions consists of algorithms, which are further divided into two classes. (A) DAG optimization techniques, which will cover the combinations of at least 3 KPIs and 2 optimization types, 1 higher-level (task re-ordering) and 1 lower-level (exact implementation and location decision among several alternatives). The 3 KPIs will refer to resource consumption, running time in a parallel environment and combination of monetary cost and running time. All algorithmic solutions will be of low polynomial complexity in the number of DAG vertices and the number of alternatives (the brute force solutions are intractable). (B) Metadata acquisition, which aims to provide a systematic way to extract the statistical and dependency information required by the algorithms in (A). A key characteristic of the proposed solutions is that they are transferred from advanced data management to business process optimization.
  2. Complete system prototype: we aim to incorporate the solutions into a real system, examining the alternatives between Spark and PDI. PDI also accounts for parts of the flow running on Spark and thus is our first option. Both systems to be extended are open-source and the prototype will be open source as well.
  3. Thorough Evaluation: the effectiveness of the solutions will be verified through extensive experimentation, using established benchmarks, such as TPC-DI, TPC-DS, BigBench and business process flows.

Dissemination material regarding the project is listed at: https://www.elidek.gr/ereynitika-erga-melon-dep-ereyniton-trion/meli-dep/e-p-3-mathimatika-kai-epistimi-tis-pliroforias/.

Publications

1. Georgia Kougka, Anastasios Gounaris, Alkis Simitsis: The many faces of data-centric workflow optimization: a survey. Int. J. Data Sci. Anal. 6(2): 81-107 (2018)
(The proposal was partially based on this publication.) [link] [pdf]

2. Georgia Kougka, Anastasios Gounaris: Optimization of data flow execution in a parallel environment. Distributed Parallel Databases 37(3): 385-410 (2019)
(The proposal was partially based on this publication.) [link] [pdf]

3. Georgia Kougka, Konstantinos Varvoutas, Anastasios Gounaris, George Tsakalidis, Kostas Vergidis: On Knowledge Transfer from Cost-Based Optimization of Data-Centric Workflows to Business Process Redesign. Trans. Large Scale Data Knowl. Centered Syst. 43: 62-85 (2020)
(Produced in the period between the proposal preparation and the project kick-off.) [link] [pdf]

4. Anna-Valentini Michailidou, Anastasios Gounaris: Bi-objective Traffic Optimization in Geo-distributed Data Flows. Big Data Res. 16: 36-48 (2019)
(Produced in the period between the proposal preparation and the project kick-off.) [link] [pdf]

5. Anna-Valentini Michailidou, Anastasios Gounaris: A fast solution for bi-objective traffic minimization in geo-distributed data flows. IDEAS 2019: 27:1-27:10
(Produced in the period between the proposal preparation and the project kick-off.) [link] [pdf]

6. Ioannis Mavroudopoulos, Anastasios Gounaris: Detecting Temporal Anomalies in Business Processes Using Distance-Based Methods. DS 2020: 615-629 [link] [pdf]

7. Konstantinos Varvoutas, Anastasios Gounaris: Evaluation of Heuristics for Product Data Models. Business Process Management Workshops 2020: 355-366 [link] [pdf]

8. Ioannis Mavroudopoulos, Theodoros Toliopoulos, Christos Bellas, Andreas Kosmatopoulos, Anastasios Gounaris: Sequence detection in event log files. EDBT 2021: 85-96 [link] [pdf]

9. Konstantinos Varvoutas, Anastasios Gounaris, Georgia Kougka: Mapping dmn to pdm to enable optimizations. BICOD 2021: 1-9 [link] [pdf]

10. Anna-Valentini Michailidou, Anastasios Gounaris, Moysis Symeonides, Demetris Trihinas: Equality: Quality-aware intensive analytics on the edge. Information Systems, 105:101953, 2022 [link] [pdf]

11. Ioannis Mavroudopoulos, Anastasios Gounaris: A comparison of proximity-based methods for detecting temporal anomalies in business processes. Machine Learning 2022 [link] [pdf]

12. Konstantinos Varvoutas, Georgia Kougka, Anastasios Gounaris: Optimizing business processes through parallel task execution. MEDES 2022. [link] [pdf]

13. Ioannis Mavroudopoulos, Anastasios Gounaris: SIESTA: A Scalable InfrastructurE of Sequential paTtern Analysis. IEEE Trans. on Big Data (to appear) 2023.

14. Anastasios Gounaris, Anna-Valentini Michailidou, Schahram Dustdar: Toward building edge learning pipelines. IEEE Internet Computing (to appear) 2023. 

Repositories

The project team provides open-source implementations in the following repositories:

Team

Anastasios Gounaris, Assistant Professor
Kostas Tsichlas, Assistant Professor
Georgia Kougka, Post-Doc Researcher
Anna-Valentini Michailidou, PhD Student
Ioannis Mavroudopoulos, PhD Student
Konstantinos Varvoutas, PhD Student

Contact

Project Coordinator
Anastasios Gounaris, Assistant Professor
gounaria@csd.auth.gr

DataflowOpt

Title: DataflowOpt

Project No: HFRI-FM17-1052

Duration: -

Funded under: HFRI