Chapter 4

Best Practices

ChistaDATA ClickHouse Cloud Best Practices

Have access control at the Workspace level - To ensure secure access to the ClickHouse, it is recommended to manage access at the workspace level. Properly assigning admin roles between the organization, workspace, and read-only access can help maintain data integrity and prevent unauthorized access. By implementing these best practices, access to the portal can be managed effectively and securely.
Try out the FREE tier plan - ChistaDATA provides a free trial for the ChistaDATA Cloud with a 90-day subscription plan. This allows users to test out the platform and its features before committing to a paid subscription. With this trial, users can explore the portal’s capabilities and determine if it meets their needs for data management and analysis. It’s a great opportunity for those who want to get hands-on experience with ClickHouse before making a financial commitment.

For Beginners try with sample data - If you’re new to ClickHouse and want to experiment with it, you can enable the “load dummy data” option when creating a cluster in the portal. This will provide you with enough data to play around with and test different functionalities. Additionally, sample queries are available on the portal documentation to help you get started. This can be a helpful way to gain experience and build your understanding of the platform.
Limit to only authorized users at Database - To ensure security in the ChistaDATA Cloud for ClickHouse, only authorized users should have limited and required grants to access the database and tables. This can be achieved by implementing a permission system with user accounts that have specific roles and grants that restrict their access to sensitive data and functionality.

Strong BC/DR - To ensure data recovery during a disaster recovery situation, it is highly recommended to take regular backups of your ClickHouse databases. ClickHouse portal provides both on-demand and scheduled backup functions, making it easy to back up your data. By implementing a regular backup strategy, you can protect your critical data and have confidence that it can be restored quickly and efficiently in the event of a disaster.

Implement multiple sources of Authentication - ChistaDATA provides robust and secure authentication methods to enhance the security of the ClickHouse portal. Users can use their GitHub authentication to log in, which can simplify the login process and eliminate the need to create and remember separate login credentials. Additionally, two-factor authentication can be enabled for the portal, which provides an additional layer of security by requiring a unique verification code in addition to the standard login credentials.

Explain Plan - It is recommended to analyze the query optimization plan using the EXPLAIN function available on the ClickHouse portal. The EXPLAIN command provides a detailed breakdown of how ClickHouse processes the query and helps users identify potential performance bottlenecks. By understanding how ClickHouse processes the query, users can optimize their queries to improve performance and reduce execution time. The EXPLAIN command is an invaluable tool for analyzing query performance and identifying opportunities for optimization in ClickHouse.

Wisely select Primary Key / Sorting key - Selecting the primary key is critical when designing the data schema as it impacts the overall performance of the database. An efficient and relevant primary key improves query performance and reduces execution time. Ensure frequently used columns in the WHERE clause are part of the primary key/Sorting key. Additionally, the primary key/Sort key should not be excessively long to optimize performance.

Choose the appropriate Table Engine - It is essential to evaluate your data needs, workload characteristics, and business requirements to choose the right table engine for your use case. The table engine that you choose can have a direct impact on performance and the characteristics of your requirements.
For example, if you need optimal performance for writes and reads, MergeTree is a good choice. On the other hand, ReplacingMergeTree is useful for the deduplication of data. To achieve durability, you can select ReplicatedMergeTree, which ensures data consistency and availability even in the event of node failures or network disruptions.
If scalability is your primary concern, the Distributed engine allows data to be distributed across multiple servers and nodes, providing horizontal scalability. By carefully selecting the appropriate table engine, you can optimize the performance, scalability, durability, and functionality of your system to meet your specific needs.

Choose the right Data Type - Select the appropriate data type for each column in your table to reduce storage and memory usage. Choosing the right data type is critical in ClickHouse as it can impact storage requirements, query performance, data integrity, compatibility with other systems, and future-proofing. By selecting the appropriate data type for your data, you can optimize performance and ensure that your data is accurate, efficient, and easily accessible.

Use Partitioning - Partitioning is a crucial feature in ClickHouse that can significantly improve query performance, scalability, and data management. By properly partitioning your data, you can take advantage of the full potential of ClickHouse and achieve fast and efficient analytics.
ClickHouse supports several types of partitioning, including range, hash, and list partitioning. Each type of partitioning has its own benefits and use cases. Range partitioning is useful for time-series data, while hash partitioning is useful for evenly distributing data across nodes.

Take advantage of Compression - Compression is a powerful feature in ClickHouse that offers several benefits, including improved query performance, reduced storage requirements, and enhanced system performance. ClickHouse supports several compression algorithms, including LZ4, ZSTD, Zlib, and Brotli.\ The compression ratio for each algorithm depends on various factors such as data type, distribution, and compression level. Higher compression levels yield higher compression ratios, but they may also increase CPU usage. For instance, LZ4 offers fast compression and decompression speeds with lower ratios, while ZSTD and Zlib provide higher ratios but require higher CPU usage. It’s crucial to evaluate each algorithm’s compression ratio, CPU usage, and other factors before selecting the appropriate one for your use case.

Optimize Queries - To fully leverage the performance capabilities of ClickHouse, it’s important to optimize your queries by carefully selecting the sorting and partitioning key, using filtering and aggregation to reduce the amount of data scanned by queries, and choosing the right compression algorithm, table engine, and data types.
By optimizing your queries, you can significantly reduce query execution time, improve query throughput, and enhance the overall user experience. This can be achieved by following best practices such as proper indexing, avoiding subqueries, and tuning query execution settings.
With ClickHouse’s powerful query profiling tools, you can easily identify bottlenecks and optimize queries accordingly. This will not only improve query performance but also help to reduce hardware costs and improve overall system efficiency. You can leverage ChistaDATA’s offering of Anansi Profiler to analyze query and resource utilization by each query, which can be used to further optimize your system.

Regular performance Monitoring - Performance monitoring is essential in ClickHouse to ensure that the system is running efficiently and delivering optimal performance. It involves monitoring system metrics using tools like Grafana and Prometheus, profiling queries, monitoring distributed systems using tools like Zookeeper and Consul, using system tables, and logging to obtain information about the system’s state and performance. These techniques help identify potential bottlenecks and optimize system performance, ensuring the system is delivering the best possible results.
Avoid Nullable Columns - Handling NULL values in ClickHouse can have negative impacts on performance and storage. Since NULL values require extra processing time and memory, query performance can be slowed down, particularly with complex queries or large datasets. Additionally, storing NULL values requires extra space to store the null value indicator in a separate column, which increases storage requirements. To avoid these issues, it’s best to use default values instead of NULL values whenever possible. This will help optimize query performance and reduce storage requirements in ClickHouse.

Prefer Bulk Insert over single insert - By default when inserting data into ClickHouse, a new storage part is created for each insert, which includes the data and relevant metadata. To minimize writes, it’s best to send larger inserts with more data instead of multiple smaller inserts. Ideally, data should be inserted in batches of at few thousand rows to optimize write performance. Sending too many individual insert queries per second can overwhelm the background merging process and lead to issues. It’s important to insert data in large batches to prevent this.
Additionally Bulk insert is faster than single inserts because it reduces the overhead of query preparation and execution, as well as the number of network round trips required. This makes it the preferred method for inserting data in ClickHouse.

Avoid frequent DELETE/UPDATE - In ClickHouse, it is generally recommended to avoid using DELETE and UPDATE operations due to their resource-intensive nature and their impact on query performance. These types of queries, such as ALTER TABLE … DELETE and UPDATE, can produce new mutated versions of the data parts. This means that such statements would trigger a rewrite of the entire data parts for all the data that was inserted before the mutation, resulting in a large number of write requests.
If you still need to use these operations, it is important to carefully read about mutation and lightweight delete operations.\ Alternatively, ClickHouse offers CollapsingMergeTree and ReplacingMergeTree for upserting and removing individual rows based on versioning. Users can also use TTLs and dropping partitions to remove old data regularly, which can help optimize performance and minimize resource usage.

Avoid OPTIMIZE FINAL - In ClickHouse, the OPTIMIZE FINAL command is used to merge multiple data parts into a single part, which can improve query performance by reducing the number of parts that need to be scanned. However, this command should be used with caution as it can be resource-intensive and impact query performance.
The main reason to avoid the OPTIMIZE FINAL command is that During this process, ClickHouse reads all the data parts, uncompresses, merges, compresses them into a single part, and then rewrites back into the storage, causing huge CPU and IO consumption. Note that this optimization rewrites one part even if they are already merged into a single part.
Instead of using OPTIMIZE FINAL, it is recommended to let ClickHouse run periodic background optimizations which are available by default without impacting query performance.