Architect a High-performance SQL Query Engine in Rust

Time:

Database related areas are usually testbeds of system languages. However, Rust language has not yet shown its unique power in these areas. In this talk, a high-performance open source SQL query engine written in Rust language and its friend C from scratch is introduced. The architecting and engineering are presented in the context of Rust. A preliminary state-of-the-art achievement is demonstrated as that the new query engine can do the same sum aggregation in 6 times faster than that of ClickHouse which is a popular opensource OLAP database written in C++.

Presented by

  • Jin Mingjian Jin Mingjian

    High performance expert with long-term, systematic thinkings about language and bigdata infrastructure, the creator of TensorBase which is a modern engineering effort in Rust language for building a high performance and cost-effective bigdata warehouse.

  • Resources

    Recordings

    Transcript

    Architect a High-performance SQL Query Engine in Rust

    Bard:
    Jin Mingjian uses Rust to enhance
    some database apps' performance
    as he breaks apart
    the state of the art
    to make hashtables and b-trees dance

    Jin:
    I hope to leave time to ask questions but if not you can reach out. Related works compare some projects. In Rust, it is called a data fusion which is used Apache arrow to require data image. The problem of age is still traditional. OK. Another paper named how to architecture a query compiler revisited which is paper only but inspires tensorbase.io. OK. Another projection from the work uses dedicated weld IR for data. It has abstract overhead and deep binding to the LLVM. Here is an architecture of tensorbase. It is a graph of language systems like Rust. I will talk about this today.

    Because this is Rustfest I will talk a little bit more about Rust. Rust is great. TensorBase benefit from parts of Rust's ecosystem. One thing I want to say is TensorBase is a one-person project over several months. In the TensorBase we are forced to decide performance in the corement we do modernization. We hope to keep a good signal so it is highly hackable. For Cargo it is the future in Rust. Everybody should familiarize the tooling. It is just some quick tooling. OK. Procedure macro which is great for learning Rust but it is all in problems. Here we see. What I suggest is use the nightly as possible. You can use the nightly for the future like proc macro diagnostic for debugging which is important. If you want to see from the source of TensorBase. The C interopability, having used TensorBase, thanks to the zero overhead but it has it's own problem. For example, resource management, or the error handling but for the time limit with escape.

    For concurrency it is either great. Fairly confident in Rust, right? It is nice for share-nothing thread safety in Rust. But maybe it is a little awkward with memory sharing needed. Memory sharing needed here we list some reasons. The main reason is Rust likes the memory model like Java. Here is an example that we made to take a quick look. If we want to imply the singleton like in Java or C++ we may need lazy lock which in fact can be avoided if we can have a memory model to establish this before the relationship between the change and the between the write and the read. Async await is just another feature which just a little tweak. The style is orthogonal to the performance so use it correctly. You have to. If not correctly, you may harm performance. Lifetime is an engineering excellence in Rust but it maybe make code complex. What I always recommend is to dance with rather than to evade because for high performance system you need to think carefully about resources manager. If you want to the number two way is arena allocator. We are going a little quick.

    We are back to this graph but I want to point to the corner in TensorBase is a safe organizeded component which will interact to the whole Rust system. OK. Input is just plain SQL in a parse tree. Then it transform to an IR which the IR is laid. The main reason is we want to reuse modern low-level compilation. R -- HIR is made for data relateded optimizations which cannot be handled by low levels compilers. And we will do relational algebra also. I want to point some relational algebra can be optimized by a compiler that loads this in HIR. One interesting great use is we have a unified RA operators. You may say in the traditional textbooks relational algebra operators but here we just unified into four operators. They are map, union, join, sort. Here is a prettyprinted HIR. You can see how the HIR can get transformed from the top circle query. OK.

    The core idea here is what I call the sea of pipes which unify the data and the control-flow dependencies in the graph of pipes. What is pipes? They just operator fused computing data unit. Here is just a piper. You may have heard in textbook that what kind of model it is operating model which is low and inefficient which in TensorBase so we don't use operator level volcano. We just depend on what we want to review. Low level IR is just for platform related optimization. For example, multi cores or codegen. We have parallelization representation for multi-cores. Map reduce and fork-join are there. We are talking more about this. And for linearization representation for codegen. We provide human read. The mechanics to enable and write in an elegant way to IR. OK. This the data structure in late IR. We come to the cer which is decentralized, self-scheduling, JIT compilation based kernel. Compared with the popular centralized schedule. We use decentralized. Now, two problems for centralized schedule. One is it is a single point. Single point has failure and single point limited in scalability. Lightning fast query, we have no time budget for you to initial the load for coordination. We want top performance. The compiler you may compare with the popular JIT engine because it can run on almost everything. CPU, GPU, and it is human debugable and fast enough compilation for OLAP.

    Here is a tensorbase generated kernel that you can compare with the kernel source from paper. You will see here we have max screen which is the data partition location. This is just the number of hardware inquiries in the current socket. The JIT compilation, the advantage is we can embed the runtime information in the code, which we have more optimization for compiler which cannot be done on OAT compilation. Benchmark time. Too much information here. We just give the simple summary here. You can see it later. Here with TensorBase can do the end to end inquery in the 6 to 10 times faster than C++ or OLAP database. Let's do a little point here. First, Rust is lightning fast even untuned. Because the time limit have no current compilation system. Second, C based JIT com pilation is lightning fast and it is much faster than C++ and Rust and it is quite enough for OLAP. We can do it in mere seconds. For point two, you can saturate the memory bandwidth of the in core of such runs with 100 gigabytes per second memory bandwidth. This is already memory bound. We just need the Tensor were the 60 milliseconds to scan over the memory. We just -- when we scan over the memory, we do something to get a result, and here is just you can say the sleep here is just that we can turn the query to the titleal -- title of the corner. For point three, partial compilation is a way to make the compilation time correlate to the size of the hot kernel rather than the total size of the execution codes.

    One thing I want to mention here is the height is a little high. -- overhead is a little high. Future insights. I want to point you in some direction where we are moving. One is storage layer. We have all storage layer because popular storage and the computer separation is genetically less efficient. Second, optimizer. Our core is forced to make queries can cannot be optimized fastest here. We may know to use the popular CPO but what we want is data driven and low entropy inference here. It is a little new. I want tiered C compilation. Maybe we want faster codegen. C compilation or interpretation can possibly be done in microseconds. We consider alternative to JIT compilation choices. The cranelift is a little slow as it ties on the ned. OK.

    Scheduling. OK. We may have enough time. We just leave some entry so you can think. If we can talk more. We are nearing here. Finally, the next version of TensorBase we will have main operators on single table and have storage layer V1. The biggest difference to the current version is we are provide compatibility with ClickHouse that include compatibility to ClickHouse native protocol on desktop storage. In the next version we also want to continue to superbin complex aggregations, for example, group by. Early results compared to the ClickHouse and based on the ClickHouse mergetree we can get 6-8 times faster. OK. Finally, I do a recap. Abstract overhead is everywhere. We should carefully make trade-offs in performance or future. Sometimes you can learn some little performance -- if you want more performance. Second is high performance programming in paradigm in Rust. You can, in fact we do not need to reject and save. If you hit control and save in a -- top performance OLAP is firstly achieved with engineering Rust and all shown can be picked up from the Open Source tensorbase.io. OK. Thanks. OK. Any questions? We do it quick because it is late. The time is limited.

    Moderator:
    Jin, thanks for the presentation. We have a question from the chat which is are there any Rust paradigms that have been getting in the way for this project? Or has it been basically positive?

    Jin:
    I see the question. In fact, we have used the raster paradigm in the TensorBase. My so-called high performance - we have some low level inquiries here. How singleton. In fact, it is not a singleton because we need lock. It is impacting some problems. We can improve in Rust. OK. We, in fact, in my presentation we just -- may I present here is the problem we could improve in Rust. For example, left off is a great idea/concept but sometimes when the compiler nodes select especially before the -- in IRR no code left is introduced. We have swung the limit on the lifetime. We have two sort many ways to complete the problem. We are getting better. In fact, the communicator is continuing to improve the problem on more ecosystem. Sometimes you need the workaround but sometimes you may still consider the workaround. For example, it is because sometimes we just not leave the time for us. We just unlock the object as we want. When in the IR phase, when we parse the phase, we dispose the object and the allocatur together. So it is a nice. We don't think much more. Basically my experience from roster is a positive. With the engineering Rust tooling, the semantic language here we mentioned here but you can find more on the open source repo.

    Moderator:
    OK. Thanks for answering. We are running out of time. So thanks again for the great presentation about TensorBase. Thank you.