数据检索吞吐量

系统教程行业动态更新时间:2024-06-14 17:01:34

数据检索吞吐量 - ETS查找与进程间消息传递(Data Retrieval Throughput - ETS lookup vs inter-process Messaging)

假设我们有一个涉及数千个进程的erlang应用程序。假设存在单个资源X，其可以是元组，列表或任何erlang术语，所有这些过程可能需要在任何时刻从其中读取/拾取某些内容。这种情况的一个例子是API系统，其中客户端进程可能需要在远程机器上进行读写。对于每个读/写请求，您不希望创建新连接。那么，你做了什么，你创建了一个连接池，将它们视为一个开放的管道/插座/通道池。现在，这个资源池将由数千个进程共享，这样对于每个读取或写入需求，您希望该进程检索任何可用的开放通道/资源。问题是，如果我有一个进程（单个进程）保存此信息，无论是在其进程字典中还是在其接收循环中。这意味着，只要需要免费资源，所有流程都必须向此流程发送消息。由于对这种单一资源的高需求，这个单一进程随时会有一个巨大的邮箱。或者我可以使用ETS表，并且只有一行，例如#resources{key=pool,value= List_of_openSockets_or_channels} 。但这意味着，我们所有的过程都会尝试在（高概率）相同的瞬时时间从ETS表中读取同一行。如果10,000进程在同一时间/几乎同一时间处理相同的行/记录，ETS表如何处理？然而，如果我使用一个进程，它的邮箱，如果10,000个进程同时向它发送一条消息，同一资源（它需要回复每个请求者）。请记住，此操作可能会频繁发生。什么选择（关于流程的可用性问题等等）会提供更高的吞吐量，以便流程可以更快地获得他们需要的东西？有没有其他更好的方法，以一种能够快速访问数百万个进程的方式处理Erlang VM中的高需求数据结构，即使它们同时需要该资源？

suppose we have an erlang application which involves thousands of processes. Suppose there is a single resource X which may be a tuple, a list, or any erlang term, which all these processes may need to read / pick out something from it, at any moment in time. An example of such an occurrence, is say, an API system, in which client processes may need to read and write on a remote machine. Ant it happens that you do not want, for each read/write request, a new connection to be created. So, what you do, you create a pool of connections, consider them as a pool of open pipes/sockets/channels. Now, this pool of resources is to be shared by thousands of processes such that for each read or write demand, you want that process to retrieve any available open channel/resource. Question is, what if i have a process (a single process) hold this information, whether in its process dictionary or in its receive loop. It would mean that all the processes would have to send a message to this process whenever they need a free resource. This single process would have a huge mailbox at any time because of the high demand for this single resource. OR I could use an ETS Table, and have only one row, say, #resources{key=pool,value= List_of_openSockets_or_channels}. But this would mean that, all our processes would attempt to make a read from the ETS Table for the same row at (high probability) same instantaneous times. How would the ETS Table handle, if 10,000 process atttempt a read, for the same row/record from it, at the same time/at almost same time ? and yet, if i use a process, its mailbox, if 10,000 processes send a message to it, at same time, for the same resource (and it would need to reply each requestor). And remember this action may occur so frequently. What option (dis-regarding availability issues of process going down blah blah), would provide higher throughput, in a way that, processes would get what they need faster ? Is there any other better way, of handling high demand data structures in the Erlang VM in a way that will provide very fast access to millions of processes, even if they all needed that resource at the same time ?

最满意答案

简答：简介。尝试不同的方法并验证系统的行为方式。

首先，我会看一下ETS' {read_concurrency, true}选项。从文档：

{read_concurrency，boolean（）}性能调优。默认值为false。设置为true时，表针对并发读取操作进行了优化。在具有SMP支持的运行时系统上启用此选项时，读取操作会变得更便宜; 尤其是在具有多个物理处理器的系然而，在读和写操作之间切换变得更加昂贵。您通常希望在并发读取操作比写入操作更频繁时，或者并发读取和写入带来大量读取和写入突发时（即，大量读取不会被写入中断，并且大量写入不会被中断）读取）。当公共访问模式是少量几次与少量写操作交错的读操作时，通常不希望启用此选项。在这种情况下，您将通过启用此选项来降低性能。 read_concurrency选项可以与write_concurrency选项结合使用。当大型并发读突发和大并发写突发很常见时，您通常希望将它们组合在一起。

其次，我会看一下缓存的可能性。进程只读取一次或多次？如果他们多次访问它，您可以阅读一次并将其存储在您的流程状态中。

第三，您可以尝试在整个系统中复制和分发该信息。划分et impera。

Short answer: profile. Try different approaches and verify how your system behaves.

Firstly, I would look at ETS' {read_concurrency, true} option. From the documentation:

{read_concurrency,boolean()} Performance tuning. Default is false. When set to true, the table is optimized for concurrent read operations. When this option is enabled on a runtime system with SMP support, read operations become much cheaper; especially on systems with multiple physical processors. However, switching between read and write operations becomes more expensive. You typically want to enable this option when concurrent read operations are much more frequent than write operations, or when concurrent reads and writes comes in large read and write bursts (i.e., lots of reads not interrupted by writes, and lots of writes not interrupted by reads). You typically do not want to enable this option when the common access pattern is a few read operations interleaved with a few write operations repeatedly. In this case you will get a performance degradation by enabling this option. The read_concurrency option can be combined with the write_concurrency option. You typically want to combine these when large concurrent read bursts and large concurrent write bursts are common.

Secondly, I would look at caching possibilities. Are the processes reading that information only once or multiple times? If they're accessing it multiple times, you could read it once and store it in your process state.

Thirdly, you could try to replicate and distribute that piece of information across your system. Divide et impera.

更多推荐

本文发布于:2023-04-20 18:40:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/dzcp/19c55185c5d7f022bf8b76e8f903b07e.html