- Learning about Kudu functionality
- Learning about Kudu Architecture
- How Kudu helps in Hadoop
In order to scale out to large datasets and large clusters^ Kudu splits tables into smaller units called tablets. This splitting can be configured on a per-table basis to be based on hashing^ range partitioning^ or a combination thereof. This allows the operator to easily trade off between parallelism for analytic workloads and high concurrency for more online ones.
In order to keep your data safe and available at all times^ Kudu uses the Raft consensus algorithm to replicate all operations for a given tablet. Raft^ like Paxos^ ensures that every write is persisted by at least two nodes before responding to the client request^ ensuring that no data is ever lost due to a machine failure. When machines do fail^ replicas reconfigure themselves within a few seconds to maintain extremely high system availability.
The use of majority consensus provides very low tail latencies even when some nodes may be stressed by concurrent workloads such as Spark jobs or heavy Impala queries. But unlike eventually consistent systems^ Raft consensus ensures that all replicas will come to agreement around the state of the data^ and by using a combination of logical and physical clocks^ Kudu can offer strict snapshot consistency to clients that demand it.