We are pioneering a novel vector database system designed to handle both high-throughput streaming data and complex similarity searches. This system supports continuous vector data ingestion and real-time processing, addressing the scalability challenges inherent in dynamic and evolving datasets. Our research focuses on optimizing retrieval efficiency for applications demanding rapid vector updates and queries in a streaming context.
We are building a next-generation RAG system that leverages our cutting-edge vector database. The integration with our vector database provides a high-performance foundation for large-scale retrieval tasks, facilitating dynamic data ingestion and updating while improving the retrieval quality in RAG applications.
Our extensive work on stream processing systems has resulted in significant technological advancements. We are open to collaborations for real-world deployments. Our innovations in stream processing techniques include concurrency control, adaptive scheduling, and fine-grained optimizations for handling high-velocity (out-of-order) data streams.
click to see our other ancillary topics
This project concerning the designing of novel stream processing systems on modern hardware. For example, MorphStream, which adopts a novel approach by decomposing scheduling strategies into three dimensions and then strives to make the right decision along each dimension, based on analyzing the decision trade-offs under varying workload characteristics. Compared to the state-of-the-art, MorphStream achieves up to 3.4 times higher throughput and 69.1% lower processing latency for handling real-world use cases with complex and dynamically changing workload dependencies.
MorphStream is demonstrated at ICDE 2024. (https://icde2024.github.io/demos.html)
The intra-window join (IaWJ), i.e., joining two input streams over a single window, is a core operation in modern stream processing applications. This paper presents the first comprehensive study on parallelizing the IaWJ on modern multicore architectures. Our follow-up works have been published in ICDE 2024, SIGMOD 2024, and SIGMOD 2025.
第四范式、南洋理工联合研究成果入围国际顶会SIGMOD 2024 (https://www.csdn.net/article/2023-12-18/135066223)
Data Stream Clustering (DSC) plays an important role in mining continuous and unlabeled data streams in real-world applications. Over the last decades, numerous DSC algorithms have been proposed with promising clustering accuracy and efficiency. Our study conducts a thorough empirical evaluation of these algorithms. Our follow up works concerning designing of better DSC algorithm.
Our Sesame Python API package has been released to PyPI at https://pypi.org/project/pysame. Sesame is a scalable stream mining library on modern hardware written in C++. By now, Sesame contains several representative real-world stream clustering algorithms and synthetic algorithms. (https://pypi.org/project/pysame)
Stream Learning (SL) requires models to rapidly adapt to continuous data streams, setting it apart from traditional Continual Learning (CL). Recent SL methods emphasize efficiency by selecting data subsets for training, but they often struggle due to their reliance on static, rule-based selection algorithms that cannot effectively adapt to the changing importance of data. We conducted a series of works on SL in this project concerning topics like online sentiment analysis, LLM updates, and so on.
Data stream compression attracts much attention recently due to the rise of IoT applications. Thanks to the balanced computational power and energy consumption, asymmetric multicores are widely used in IoT devices. This paper introduces CStream, a novel framework for parallelizing stream compression on asymmetric multicores to minimize energy consumption without violating the user-specified compressing latency constraint. Existing works cannot effectively utilize asymmetric multicores for stream compression, primarily due to the non-trivial asymmetric computation and asymmetric communication effects. To this end, CStream is developed with the following two novel designs: 1) fine-grained decomposition, which decomposes a stream compression procedure into multiple finegrained tasks to better expose the task-core affinities under the asymmetric computation effects; and 2) asymmetry-aware task scheduling, which schedules the decomposed tasks based on a novel cost model to exploit the exposed task-core affinities while considering asymmetric communication effects. To validate our proposal, we evaluate CStream with five competing mechanisms of parallelizing stream compression algorithms on a recent asymmetric multicore processor. Our extensive experiments based on a benchmark consisting of three algorithms and four datasets show that CStream outperforms alternative approaches by up to 53% lower energy consumption without compressing latency constraint violation.
Some grants are transferred or terminated due to PI’s move from SUTD to NTU in 2023. ↩
© IntelliStream Research Group.
We are part of the Cluster and Grid Computing Lab) at the Huazhong University of Science and Technology - 华中科技大学.