Events Calendar

09 Apr
CS Colloquium: Automating Distributed Tiered Storage Management in Cluster Computing--Herodotos Herodotou
Event Type

Lectures, Symposia, Etc.

Topic

Research, Technology

Target Audience

Undergraduate Students, Staff, Faculty, Graduate Students

Website

https://pitt.co1.qualtrics.com/jfe/fo...

University Unit
Department of Computer Science
Hashtag

#cs

Subscribe
Google Calendar iCal Outlook

CS Colloquium: Automating Distributed Tiered Storage Management in Cluster Computing--Herodotos Herodotou

Abstract:
Data-intensive platforms such as Hadoop and Spark are routinely used to process massive amounts of data residing on distributed file systems like HDFS. Increasing memory sizes and new hardware technologies (e.g., NVRAM, SSDs) have recently led to the introduction of storage tiering in such settings. However, users are now burdened with the additional complexity of managing the multiple storage tiers and the data residing on them, while trying to optimize their workloads. In this talk, I will present OctopusFS, a novel distributed file system that is aware of heterogeneous storage media (e.g., memory, SSDs, HDDs, NAS) with different capacities and performance characteristics. The system offers a variety of pluggable policies for automating data management across the storage tiers and cluster nodes. Smart placement and retrieval policies employ multi-objective optimization techniques for making intelligent data management decisions based on the requirements of fault tolerance, data and load balancing, and throughput maximization. In addition, redistribution policies employ machine learning for tracking and predicting file access patterns, which are used to decide when and which data to move up or down the storage tiers for increasing system performance. The approach uses incremental learning to dynamically refine the models with new file accesses, allowing them to naturally adjust and adapt to workload changes over time. Our extensive evaluation using realistic workloads derived from Facebook and CMU traces compares our approach with several other policies and showcases significant benefits in terms of both workload performance and cluster efficiency.

Bio:
Herodotos Herodotou is an Assistant Professor in the Department of Electrical Engineering, Computer Engineering and Informatics at the Cyprus University of Technology, where he is leading the Data Intensive Computing Research Lab. He received his Ph.D. in Computer Science from Duke University. His Ph.D. dissertation work on building a self-tuning system for big data analytics received the ACM SIGMOD Jim Gray Doctoral Dissertation Award Honorable Mention as well as the Outstanding Ph.D. Dissertation Award in Computer Science at Duke. Before joining CUT, he held research positions at Microsoft Research, Yahoo! Labs, and Aster Data. His research interests are in large-scale Data Processing Systems, Database Systems, and Cloud Computing. In particular, his work focuses on ease-of-use, manageability, and automated tuning of both centralized and distributed data-intensive computing systems. In addition, he is interested in applying database techniques in other areas like maritime informatics, scientific computing, bioinformatics, and social computing. His research work to date has been published in several top scientific conferences and journals, two books, and two book chapters, while he is actively participating in multiple European and nationally funded projects.

Host: Constantinos Costa

RSVP: https://pitt.co1.qualtrics.com/jfe/form/SV_3BIKMKsx5Gx7nGC

Friday, April 9 at 2:00 p.m. to 3:00 p.m.

Virtual Event

CS Colloquium: Automating Distributed Tiered Storage Management in Cluster Computing--Herodotos Herodotou

Abstract:
Data-intensive platforms such as Hadoop and Spark are routinely used to process massive amounts of data residing on distributed file systems like HDFS. Increasing memory sizes and new hardware technologies (e.g., NVRAM, SSDs) have recently led to the introduction of storage tiering in such settings. However, users are now burdened with the additional complexity of managing the multiple storage tiers and the data residing on them, while trying to optimize their workloads. In this talk, I will present OctopusFS, a novel distributed file system that is aware of heterogeneous storage media (e.g., memory, SSDs, HDDs, NAS) with different capacities and performance characteristics. The system offers a variety of pluggable policies for automating data management across the storage tiers and cluster nodes. Smart placement and retrieval policies employ multi-objective optimization techniques for making intelligent data management decisions based on the requirements of fault tolerance, data and load balancing, and throughput maximization. In addition, redistribution policies employ machine learning for tracking and predicting file access patterns, which are used to decide when and which data to move up or down the storage tiers for increasing system performance. The approach uses incremental learning to dynamically refine the models with new file accesses, allowing them to naturally adjust and adapt to workload changes over time. Our extensive evaluation using realistic workloads derived from Facebook and CMU traces compares our approach with several other policies and showcases significant benefits in terms of both workload performance and cluster efficiency.

Bio:
Herodotos Herodotou is an Assistant Professor in the Department of Electrical Engineering, Computer Engineering and Informatics at the Cyprus University of Technology, where he is leading the Data Intensive Computing Research Lab. He received his Ph.D. in Computer Science from Duke University. His Ph.D. dissertation work on building a self-tuning system for big data analytics received the ACM SIGMOD Jim Gray Doctoral Dissertation Award Honorable Mention as well as the Outstanding Ph.D. Dissertation Award in Computer Science at Duke. Before joining CUT, he held research positions at Microsoft Research, Yahoo! Labs, and Aster Data. His research interests are in large-scale Data Processing Systems, Database Systems, and Cloud Computing. In particular, his work focuses on ease-of-use, manageability, and automated tuning of both centralized and distributed data-intensive computing systems. In addition, he is interested in applying database techniques in other areas like maritime informatics, scientific computing, bioinformatics, and social computing. His research work to date has been published in several top scientific conferences and journals, two books, and two book chapters, while he is actively participating in multiple European and nationally funded projects.

Host: Constantinos Costa

RSVP: https://pitt.co1.qualtrics.com/jfe/form/SV_3BIKMKsx5Gx7nGC

Friday, April 9 at 2:00 p.m. to 3:00 p.m.

Virtual Event

Hashtag

#cs