Cassandra，从仅附加表格过滤最新的行(Cassandra, filter latest rows from an append only table)

编程入门行业动态更新时间:2024-10-26 18:21:55

目前我有一个简单的表格如下：

CREATE TABLE datatable (timestamp bigint, value bigint, PRIMARY KEY (timestamp))

这张桌子只是在增长，并且从未被修改过。关键是独特的时间戳。所有查询都是表单的范围查询：

SELECT * from datatable WHERE timestamp > 123456 ALLOW FILTERING

此外，查询只请求插入一小组最新的行。我现在面临的问题是这些查询的表现与表格大小负相关。随着表的增长，即使查询只返回几行，获取响应也需要很长的时间。

你能否建议我应该如何修改表模式以避免性能下降（例如，创建索引或集合集群）？谢谢！

Currently I have a simple table as follows:

CREATE TABLE datatable (timestamp bigint, value bigint, PRIMARY KEY (timestamp))

This table is only growing and never being modified. The key is unique timestamp. All queries are range queries of the form:

SELECT * from datatable WHERE timestamp > 123456 ALLOW FILTERING

Moreover, queries request only a small set of the latest rows inserted. The problem that I have right now is that performance of these queries negatively correlated with the table size. As table grows, it takes significantly longer to get response, even if query returns just a few rows.

Could you advise on how I should modify table schema to avoid performance degradation (e.g., create index or set clustering)? Thanks!

最满意答案

添加一些时间像

CREATE TABLE datatable ( bucket timestamp, time timestamp, value bigint, PRIMARY KEY ((bucket), time) ) WITH CLUSTERING ORDER BY (time DESC);

其中bucket是将日期截断为一天或一周或一个月（可以根据近似摄取率计算出多少，每个分区的体面目标大约为64mb，但这非常灵活），这样您将收集所有行的一段时间在一个分区内非常有效。

每个节点拥有数十亿个分区将显着减缓维修和压缩。分区顺序也是随机的（分区键顺序的murmur3哈希），所以你不能按照顺序执行你的查询之上的操作。

使用上面的代码，你可以从开始时间的桶到当前桶的迭代，而不需要ALLOW FILTERING （你永远不应该在玩具的数据量或测试环境之外使用），结果将按照时间戳。

Add some time bucketing like

CREATE TABLE datatable ( bucket timestamp, time timestamp, value bigint, PRIMARY KEY ((bucket), time) ) WITH CLUSTERING ORDER BY (time DESC);

where bucket is the date truncated to the day or week or month (can figure out how many based on approx ingestion rate, a decent goal is about 64mb per partition but thats very flexible), that way you will collect all the rows for a period within a single partition very efficiently.

Having billions of partitions per node will cause slow down repairs and compactions significantly. Also partitioning order is random (murmur3 hash of the partition key order) so you cannot do things like have your above your query in order.

With the above you can then iterate from the bucket of your start time to the current bucket without ALLOW FILTERING (which you should never ever use outside of toy amounts of data or test environment kinda things) and the results will be in the order of the timestamps.

更多推荐

本文发布于:2023-08-06 20:48:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1455430.html