Research

Our Research Focus

1. Vector Database with Streaming Capabilities

We are pioneering a novel vector database system designed to handle both high-throughput streaming data and complex similarity searches. This system supports continuous vector data ingestion and real-time processing, addressing the scalability challenges inherent in dynamic and evolving datasets. Our research focuses on optimizing retrieval efficiency for applications demanding rapid vector updates and queries in a streaming context.

[arXiv’24] CANDY: A Benchmark for Continuous Approximate Nearest Neighbor Search with Dynamic Data Ingestion
[NIPS’24] LibAMM: Empirical Insights into Approximate Computing for Accelerating Matrix Multiplication

2. Retrieval-Augmented Generation (RAG)

We are building a next-generation RAG system that leverages our cutting-edge vector database. The integration with our vector database provides a high-performance foundation for large-scale retrieval tasks, facilitating dynamic data ingestion and updating while improving the retrieval quality in RAG applications.

[arXiv’24] Online Continual Knowledge Learning for Language Models
[arXiv’24] StreamPrompt: Learnable Prompt-guided Data Selection for Efficient Stream Learning

3. Stream Processing System

Our extensive work on stream processing systems has resulted in significant technological advancements. We are open to collaborations for real-world deployments. Our innovations in stream processing techniques include concurrency control, adaptive scheduling, and fine-grained optimizations for handling high-velocity (out-of-order) data streams.

click to see our other ancillary topics

Project Highlights

Scalable Stream Processing Systems

Project Image

Project Description

This project concerning the designing of novel stream processing systems on modern hardware. For example, MorphStream, which adopts a novel approach by decomposing scheduling strategies into three dimensions and then strives to make the right decision along each dimension, based on analyzing the decision trade-offs under varying workload characteristics. Compared to the state-of-the-art, MorphStream achieves up to 3.4 times higher throughput and 69.1% lower processing latency for handling real-world use cases with complex and dynamically changing workload dependencies.

Publications

[ICDE 2024] Siqi Xiang, Zhonghao Yang, Shuhao Zhang, Jianjun Zhao, Yancan Mao. MorphStream: Scalable Processing of Transactions over Streams, ICDE (Demo), 2024 - Siqi Xiang, Zhonghao Yang, Shuhao Zhang, Jianjun Zhao, Yancan Mao (2024)
[ICDE 2024] Fast Parallel Recovery for Transactional Stream Processing on Multicores - Jianjun Zhao, Haikun Liu, Shuhao Zhang, Zhuohui Duan, Xiaofei Liao, Hai Jin, Yu Zhang (2024)
[VLDBJ 2023] A survey on transactional stream processing - Zhang, S., Soto, J. & Markl, V. (2023)
[SIGMOD 2023] MorphStream: Adaptive Scheduling for Scalable Transactional Stream Processing on Multicores - Yancan Mao, Jianjun Zhao, Shuhao Zhang, Haikun Liu, Volker Markl (2023)
[ICDE 2020] Towards Concurrent Stateful Stream Processing on Multicore Processors - Shuhao Zhang, Yingjun Wu, Feng Zhang, Bingsheng He (2020)
[USENIX ATC 2020] FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures - Feng Zhang, Lin Yang, Shuhao Zhang, Bingsheng He, Wei Lu, Xiaoyong Du (2020)
[SIGMOD 2019] BriskStream: Scaling Data Stream Processing on Shared-Memory Multicore Architectures - Shuhao Zhang, Jiong He, Amelie Chi Zhou, Bingsheng He (2019)

Project News

MorphStream is demonstrated at ICDE 2024. (https://icde2024.github.io/demos.html)

Parallel Stream Window Join

Project Image

Project Description

The intra-window join (IaWJ), i.e., joining two input streams over a single window, is a core operation in modern stream processing applications. This paper presents the first comprehensive study on parallelizing the IaWJ on modern multicore architectures. Our follow-up works have been published in ICDE 2024, SIGMOD 2024, and SIGMOD 2025.

Publications

[SIGMOD 2025] Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and Quality - Xilin Tang, Feng Zhang, Shuhao Zhang, Yani Liu, Bingsheng He, Xiaoyong Du (2025)
[SIGMOD 2024] PECJ: Stream Window Join on Disorder Data Streams with Proactive Error Compensation - Zeng X, Zhang S, Zhong H, Zhang H, Lu M, Zheng Z, Chen Y (2024)
[ICDE 2023] Scalable Online Interval Join on Modern Multicore Processors in OpenMLDB - H. Zhang, X. Zeng, S. Zhang, X. Liu, M. Lu, Z. Zheng (2023)
[SIGMOD 2021] Parallelizing Intra-Window Join on Multicores: An Experimental Study - Shuhao Zhang, Yancan Mao, Jiong He, Philipp M. Grulich, Steffen Zeuch, Bingsheng He, Richard T. B. Ma, Volker Markl (2021)

Project News

第四范式、南洋理工联合研究成果入围国际顶会SIGMOD 2024 (https://www.csdn.net/article/2023-12-18/135066223)

Data Stream Clustering

Project Image

Project Description

Data Stream Clustering (DSC) plays an important role in mining continuous and unlabeled data streams in real-world applications. Over the last decades, numerous DSC algorithms have been proposed with promising clustering accuracy and efficiency. Our study conducts a thorough empirical evaluation of these algorithms. Our follow up works concerning designing of better DSC algorithm.

Publications

[Arxiv 2024] MOStream: A Modular and Self-Optimizing Data Stream Clustering Algorithm - Zhengru Wang, Xin Wang, Shuhao Zhang (2024)
[SIGMOD 2023] Data Stream Clustering: An In-depth Empirical Study - Xin Wang, Zhengru Wang, Zhenyu Wu, Shuhao Zhang, Xuanhua Shi, Li Lu (2023)

Project News

Our Sesame Python API package has been released to PyPI at https://pypi.org/project/pysame. Sesame is a scalable stream mining library on modern hardware written in C++. By now, Sesame contains several representative real-world stream clustering algorithms and synthetic algorithms. (https://pypi.org/project/pysame)

Online Continual Learning

Project Image

Project Description

Stream Learning (SL) requires models to rapidly adapt to continuous data streams, setting it apart from traditional Continual Learning (CL). Recent SL methods emphasize efficiency by selecting data subsets for training, but they often struggle due to their reliance on static, rule-based selection algorithms that cannot effectively adapt to the changing importance of data. We conducted a series of works on SL in this project concerning topics like online sentiment analysis, LLM updates, and so on.

Publications

[Arxiv 2024] StreamPrompt: Learnable Prompt-guided Data Selection for Efficient Stream Learning - Tongjun Shi, Shuhao Zhang (2024)
[EMNLP 2023] SentiStream: A Co-Training Framework for Adaptive Online Sentiment Analysis in Evolving Data Streams - Yuhao Wu, Karthick Sharma, Chun Seah, Shuhao Zhang (2023)
[Arxiv 2023] Online Continual Knowledge Learning for Language Models - Yuhao Wu, Tongjun Shi, Karthick Sharma, Chun Wei Seah, Shuhao Zhang (2023)

Project News

Data Stream Compression

Project Image

Project Description

Data stream compression attracts much attention recently due to the rise of IoT applications. Thanks to the balanced computational power and energy consumption, asymmetric multicores are widely used in IoT devices. This paper introduces CStream, a novel framework for parallelizing stream compression on asymmetric multicores to minimize energy consumption without violating the user-specified compressing latency constraint. Existing works cannot effectively utilize asymmetric multicores for stream compression, primarily due to the non-trivial asymmetric computation and asymmetric communication effects. To this end, CStream is developed with the following two novel designs: 1) fine-grained decomposition, which decomposes a stream compression procedure into multiple finegrained tasks to better expose the task-core affinities under the asymmetric computation effects; and 2) asymmetry-aware task scheduling, which schedules the decomposed tasks based on a novel cost model to exploit the exposed task-core affinities while considering asymmetric communication effects. To validate our proposal, we evaluate CStream with five competing mechanisms of parallelizing stream compression algorithms on a recent asymmetric multicore processor. Our extensive experiments based on a benchmark consisting of three algorithms and four datasets show that CStream outperforms alternative approaches by up to 53% lower energy consumption without compressing latency constraint violation.

Publications

[TKDE 2024] CStream: Parallel Data Stream Compression on Multicore Edge Devices - Xianzhi Zeng, Shuhao Zhang (2024)
[ICDE 2023] Parallelizing Stream Compression for IoT Applications on Asymmetric Multicores - Xianzhi Zeng, Shuhao Zhang (2023)
[DEBS 2023] A Hardware-Conscious Stateful Stream Compression Framework for IoT Applications (Vision) - Xianzhi Zeng, Shuhao Zhang (2023)
[ICDE 2023] CompressStreamDB: Fine-Grained Adaptive Stream Processing without Decompression - Yu Zhang, Feng Zhang, Hourun Li, Shuhao Zhang, Xiaoyong Du (2023)

Project News

Research Grants¹

2023 - 2026 (PI. $100,000) “Parallelized Stateful Coreset Selection in Continuous Data Streams for Enhanced Stream Learning”. Funding from T1 Seed Grant.
2023 - 2026 (PI. $150,000) “Parallel Data Management in Retrieval-based Language Models”. Funding from NTU Start-Up Grants.
2023 - 2026 (Co-PI. $1,300,000) “Real-Time Federated Learning on Data Streams”. Funding from DTC.
2023 - 2026 (PI. $500,000+) “IntelliStream: Towards Highly-Optimized, Ultra-Scalable, Self-adaptive Data Streaming Analytics in the Heterogeneous Multicore IoT Systems”. Funding from Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 2.
2022 - 2025 (PI - Transferred. ~$500,000) “A Stream Processing based NFV Platform for 5G on Modern Multicore Processors”. Funding from National Research Foundation, Singapore and Infocomm Media Development Authority under its Future Communications Research & Development Programme.
2022 - 2025 (PI - Transferred. ~$500,000) “Energy-efficient, Scalable, and Reliable Distributed Green Streaming Machine Learning for Edges. Funding from National Research Foundation, Singapore and Infocomm Media Development Authority under its Future Communications Research & Development Programme.
2023 - 2023 (PI - Transferred. $100,000) “Towards Online Continual Pre-Trained Language Model Maintenance”. Funding from TL@SUTD.
2022 - 2022 (PI - Completed. $67,000) “Online Sentiment Learning of Massive Data Streams”. Funding from TL@SUTD.
2022 - 2025 (PI - Terminated. $80,000) “Revisiting the Algorithms for Clustering Evolving Trajectory Streams”. Funding from SUTD-ZJU (VP).
2021 - 2024 (PI - Terminated. $100,000) “Efficient Intra-Window Join on the Multicore IoT systems”. Funding from SUTD STARTUP RESEARCH GRANT (SRG).

Some grants are transferred or terminated due to PI’s move from SUTD to NTU in 2023. ↩

Our Research Focus

1. Vector Database with Streaming Capabilities

2. Retrieval-Augmented Generation (RAG)

3. Stream Processing System

Stream Learning

Stream Data Mining

Transactional Stream Processing

Stream Compression

Stream Window Join

Stream Processing Systems

Project Highlights

Scalable Stream Processing Systems

Project Description

Publications

Project News

Parallel Stream Window Join

Project Description

Publications

Project News

Data Stream Clustering

Project Description

Publications

Project News

Online Continual Learning

Project Description

Publications

Project News

Data Stream Compression

Project Description

Publications

Project News

Research Grants1

Research Grants¹