如何将hive并发映射器增加到4以上?

编程入门 行业动态 更新时间:2024-10-24 18:22:55
本文介绍了如何将hive并发映射器增加到4以上?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 总结

当我在hive中的表查询中运行一个简单的select count(*)时,我的大集群中只有两个节点用于映射。 详情

我使用的是一个有点大的集群(数十个节点每个超过200 GB RAM)运行hdfs和Hive 1.2.1(IBM-12)。

我有一个数十亿行的表。当我执行一个简单的

从mytable中选择count(*);

创建数百个地图任务,但只有4个同时运行。

这意味着我的集群在查询过程中大部分处于闲置状态,这似乎很浪费。我已经尝试了使用中的节点,并没有充分利用CPU或内存。我们的集群由Infiniband网络和Isilon文件存储提供支持,这两种文件都没有看起来非常重要。

我们使用mapreduce作为引擎。我试过去除了我可以找到的任何资源限制,但它并没有改变仅使用两个节点(4个并发映射器)的事实。 内存设置如下:

yarn.nodemanager.resource.memory-mb 188928 MB yarn.scheduler.minimum-分配-mb 20992 MB yarn.scheduler.maximum-allocation-mb 188928 MB yarn.app.mapreduce.am.resource.mb 20992 MB mapreduce.map.memory.mb 20992 MB mapreduce.reduce.memory.mb 20992 MB

我们在41个节点上运行。通过我的计算,我应该可以获得41 * 188928/20992 = 369个地图/减少任务。

Vcor​​e设置:

yarn.nodemanager .resource.cpu-vcores 24 yarn.scheduler.minimum-allocation-vcores 1 yarn.scheduler.maximum-allocation-vcores 24 yarn.app.mapreduce.am.resource.cpu -vcores 1 mapreduce.map.cpu.vcores 1 mapreduce.reduce.cpu.vcores 1

  • 是否有办法让hive / mapreduce使用更多的群集?
  • 如何解决瓶颈问题
  • $ b

    我猜可能是因为Yarn没有足够快地分配任务吗?使用tez可以提高性能,但我仍然对为什么资源利用率如此有限感兴趣(并且我们没有安装ATM)。 解决方案

    运行并行任务取决于纱线中的内存设置,例如,如果您有4个数据节点和纱线存储器属性定义如下

    yarn.nodemanager.resource.memory-mb 1 GB yarn.scheduler.minimum-allocation -mb 1 GB yarn.scheduler.maximum-allocation-mb 1 GB yarn.app.mapreduce.am.resource.mb 1 GB mapreduce.map.memory.mb 1 GB mapreduce.reduce.memory.mb 1 GB

    根据此设置,您有4个数据节点因此总共 yarn.nodemanager.resource.memory-mb 将为4 GB,您可以使用它来启动容器,并且由于容器可以占用1 GB的内存,因此它意味着在任何给定的时间点您可以启动4个容器,其中一个将由应用程序主控使用,因此您可以在任何给定的时间点运行最多3个映射器或减速器任务,因为应用程序主控器,映射器和缩减器均使用1 GB内存

    因此您需要增加 yarn.nodemanager.resource.memory-mb 以增加map / reduce任务的数量

    PS - 在这里我们正在考虑可以启动的最大任务,它可能比那个还要少一些。

    Summary

    When I run a simple select count(*) from table query in hive only two nodes in my large cluster are being used for mapping. I would like to use the whole cluster.

    Details

    I am using a somewhat large cluster (tens of nodes each more than 200 GB RAM) running hdfs and Hive 1.2.1 (IBM-12).

    I have a table of several billion rows. When I perform a simple

    select count(*) from mytable;

    hive creates hundreds of map tasks, but only 4 are running simultaneously.

    This means that my cluster is mostly idle during the query which seems wasteful. I have tried ssh'ing to the nodes in use and they are not utilizing CPU or memory fully. Our cluster is backed by Infiniband networking and Isilon file storage neither of which seems very loaded at all.

    We are using mapreduce as the engine. I have tried removing any limits to resources that I could find, but it does not change the fact that only two nodes are being used (4 concurrent mappers).

    The memory settings are as follows:

    yarn.nodemanager.resource.memory-mb 188928 MB yarn.scheduler.minimum-allocation-mb 20992 MB yarn.scheduler.maximum-allocation-mb 188928 MB yarn.app.mapreduce.am.resource.mb 20992 MB mapreduce.map.memory.mb 20992 MB mapreduce.reduce.memory.mb 20992 MB

    and we are running on 41 nodes. By my calculation I should be able to get 41*188928/20992 = 369 map/reduce tasks. Instead I get 4.

    Vcore settings:

    yarn.nodemanager.resource.cpu-vcores 24 yarn.scheduler.minimum-allocation-vcores 1 yarn.scheduler.maximum-allocation-vcores 24 yarn.app.mapreduce.am.resource.cpu-vcores 1 mapreduce.map.cpu.vcores 1 mapreduce.reduce.cpu.vcores 1

    • Is there are way to get hive/mapreduce to use more of my cluster?
    • How would a go about figuring out the bottle neck?
    • Could it be that Yarn is not assigning tasks fast enough?

    I guess that using tez would improve performance, but I am still interested in why resources utilization is so limited (and we do not have it installed ATM).

    解决方案

    Running parallel tasks depends on your memory setting in yarn for example if you have 4 data nodes and your yarn memory properties are defined as below

    yarn.nodemanager.resource.memory-mb 1 GB yarn.scheduler.minimum-allocation-mb 1 GB yarn.scheduler.maximum-allocation-mb 1 GB yarn.app.mapreduce.am.resource.mb 1 GB mapreduce.map.memory.mb 1 GB mapreduce.reduce.memory.mb 1 GB

    according to this setting you have 4 data nodes so total yarn.nodemanager.resource.memory-mb will be 4 GB that you can use to launch container and since container can take 1 GB memory so it means at any given point of time you can launch 4 container , one will be used by application master so you can have maximum 3 mapper or reducer tasks can ran at any given point of time since application master,mapper and reducer each is using 1 GB memory

    so you need to increase yarn.nodemanager.resource.memory-mb to increase the number of map/reduce task

    P.S. - Here we are taking about maximum tasks that can be launched,it may be some less than that also

更多推荐

如何将hive并发映射器增加到4以上?

本文发布于:2023-10-28 08:28:14,感谢您对本站的认可!
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:如何将   增加到   映射器   hive

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!