如何将hive并发映射器增加到4以上？

编程入门行业动态更新时间:2024-10-24 18:22:55

本文介绍了如何将hive并发映射器增加到4以上？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述总结

当我在hive中的表查询中运行一个简单的select count（*）时，我的大集群中只有两个节点用于映射。详情

我使用的是一个有点大的集群（数十个节点每个超过200 GB RAM）运行hdfs和Hive 1.2.1（IBM-12）。

我有一个数十亿行的表。当我执行一个简单的

从mytable中选择count（*）;

创建数百个地图任务，但只有4个同时运行。

这意味着我的集群在查询过程中大部分处于闲置状态，这似乎很浪费。我已经尝试了使用中的节点，并没有充分利用CPU或内存。我们的集群由Infiniband网络和Isilon文件存储提供支持，这两种文件都没有看起来非常重要。

我们使用mapreduce作为引擎。我试过去除了我可以找到的任何资源限制，但它并没有改变仅使用两个节点（4个并发映射器）的事实。内存设置如下：

yarn.nodemanager.resource.memory-mb 188928 MB yarn.scheduler.minimum-分配-mb 20992 MB yarn.scheduler.maximum-allocation-mb 188928 MB yarn.app.mapreduce.am.resource.mb 20992 MB mapreduce.map.memory.mb 20992 MB mapreduce.reduce.memory.mb 20992 MB

我们在41个节点上运行。通过我的计算，我应该可以获得41 * 188928/20992 = 369个地图/减少任务。

Vcore设置：

yarn.nodemanager .resource.cpu-vcores 24 yarn.scheduler.minimum-allocation-vcores 1 yarn.scheduler.maximum-allocation-vcores 24 yarn.app.mapreduce.am.resource.cpu -vcores 1 mapreduce.map.cpu.vcores 1 mapreduce.reduce.cpu.vcores 1

是否有办法让hive / mapreduce使用更多的群集？
如何解决瓶颈问题

我猜可能是因为Yarn没有足够快地分配任务吗？使用tez可以提高性能，但我仍然对为什么资源利用率如此有限感兴趣（并且我们没有安装ATM）。 解决方案

运行并行任务取决于纱线中的内存设置，例如，如果您有4个数据节点和纱线存储器属性定义如下

yarn.nodemanager.resource.memory-mb 1 GB yarn.scheduler.minimum-allocation -mb 1 GB yarn.scheduler.maximum-allocation-mb 1 GB yarn.app.mapreduce.am.resource.mb 1 GB mapreduce.map.memory.mb 1 GB mapreduce.reduce.memory.mb 1 GB
根据此设置，您有4个数据节点因此总共 yarn.nodemanager.resource.memory-mb 将为4 GB，您可以使用它来启动容器，并且由于容器可以占用1 GB的内存，因此它意味着在任何给定的时间点您可以启动4个容器，其中一个将由应用程序主控使用，因此您可以在任何给定的时间点运行最多3个映射器或减速器任务，因为应用程序主控器，映射器和缩减器均使用1 GB内存

因此您需要增加 yarn.nodemanager.resource.memory-mb 以增加map / reduce任务的数量
PS - 在这里我们正在考虑可以启动的最大任务，它可能比那个还要少一些。
Summary
When I run a simple select count(*) from table query in hive only two nodes in my large cluster are being used for mapping. I would like to use the whole cluster.
Details
I am using a somewhat large cluster (tens of nodes each more than 200 GB RAM) running hdfs and Hive 1.2.1 (IBM-12).

I have a table of several billion rows. When I perform a simple
select count(*) from mytable;
hive creates hundreds of map tasks, but only 4 are running simultaneously.

This means that my cluster is mostly idle during the query which seems wasteful. I have tried ssh'ing to the nodes in use and they are not utilizing CPU or memory fully. Our cluster is backed by Infiniband networking and Isilon file storage neither of which seems very loaded at all.

We are using mapreduce as the engine. I have tried removing any limits to resources that I could find, but it does not change the fact that only two nodes are being used (4 concurrent mappers).
The memory settings are as follows:
yarn.nodemanager.resource.memory-mb 188928 MB yarn.scheduler.minimum-allocation-mb 20992 MB yarn.scheduler.maximum-allocation-mb 188928 MB yarn.app.mapreduce.am.resource.mb 20992 MB mapreduce.map.memory.mb 20992 MB mapreduce.reduce.memory.mb 20992 MB
and we are running on 41 nodes. By my calculation I should be able to get 41*188928/20992 = 369 map/reduce tasks. Instead I get 4.
Vcore settings:
yarn.nodemanager.resource.cpu-vcores 24 yarn.scheduler.minimum-allocation-vcores 1 yarn.scheduler.maximum-allocation-vcores 24 yarn.app.mapreduce.am.resource.cpu-vcores 1 mapreduce.map.cpu.vcores 1 mapreduce.reduce.cpu.vcores 1

Is there are way to get hive/mapreduce to use more of my cluster?

How would a go about figuring out the bottle neck?

Could it be that Yarn is not assigning tasks fast enough?

I guess that using tez would improve performance, but I am still interested in why resources utilization is so limited (and we do not have it installed ATM).
解决方案
Running parallel tasks depends on your memory setting in yarn for example if you have 4 data nodes and your yarn memory properties are defined as below
yarn.nodemanager.resource.memory-mb 1 GB yarn.scheduler.minimum-allocation-mb 1 GB yarn.scheduler.maximum-allocation-mb 1 GB yarn.app.mapreduce.am.resource.mb 1 GB mapreduce.map.memory.mb 1 GB mapreduce.reduce.memory.mb 1 GB
according to this setting you have 4 data nodes so total yarn.nodemanager.resource.memory-mb will be 4 GB that you can use to launch container and since container can take 1 GB memory so it means at any given point of time you can launch 4 container , one will be used by application master so you can have maximum 3 mapper or reducer tasks can ran at any given point of time since application master,mapper and reducer each is using 1 GB memory

so you need to increase yarn.nodemanager.resource.memory-mb to increase the number of map/reduce task

P.S. - Here we are taking about maximum tasks that can be launched,it may be some less than that also

更多推荐

如何将hive并发映射器增加到4以上？

本文发布于:2023-10-28 08:28:14，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1536094.html

版权声明:本站内容均来自互联网，仅供演示用，请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系，我们将在24小时内删除。

如何将增加到映射器 hive

上一篇：将8个字节转换为两倍

下一篇： Spring Boot内存消耗增加到超过

发布评论取消回复

评论列表（有 0 条评论）

最近发表

荆门网站建设的重要性

win10蓝屏终止代码CRITICAL_PROCESS_DIED解决方法

您可以尝试添加 --skip-broken 选项来解决该问题您可以尝试执行：rpm -Va --nofiles --nodigest 解决方案

关于无线网络波动大的解决办法

Windows10 关于系统中断CPU占用过高导致电脑变卡的解决办法

VS 2019 点击页面自动定位到解决方案资源管理器目录位置

（亲测解决）VMware打开需要半天才进入、打开系统很慢、运行很慢解决办法

Typora官网下载的最新版本mac10.13以下版本用不了的解决办法

成功解决ModuleNotFoundError: No module named ‘torch._C‘

MySQL:由于找不到VCRUNTIME140_1.dll，无法继续执行代码。重新安装程序可能会解决此问题

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍！

热门文章

从源“http://localhost:5173”访问“...”处的 XMLHttpRequest 已被 CORS 策略阻止

币安API错误代码1102，未发送强制参数“时间戳”

如果我在bot telegram nodejs中使用editMessageMedia，我如何制作标题

在 Node.js 中从网络流创建 blob

使用 Node.js / ES6 如何设置 dotenv 文件的自定义路径？

使用 NODE.JS 和 html5 实现低延迟（50 毫秒）视频流

如何从nodejs连接laravel>laravel

使用nodejs观看目录

如果文件包含特定字符串，如何跳过 GitHub 工作流程步骤？

FirebaseError：无法从.env加载环境变量

标签列表

文件

如何在

Python

系统

java

方法

数据

错误

windows

函数

android

linux

教程

如何使用

代码

字符串

计算机

电脑

服务器

NET

应用程序

数组

PHP

MySQL

SQL

对象

项目

程序

数据库

word