如何使用Hadoop Streaming在本地Hadoop集群中运行MRJob?

编程入门行业动态更新时间:2024-10-12 08:16:14

本文介绍了如何使用Hadoop Streaming在本地Hadoop集群中运行MRJob?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我目前正在上大数据课，我的一个项目是在本地设置的Hadoop集群上运行我的Mapper/Reducer.

I'm currently taking a Big Data Class, and one of my projects is to run my Mapper/Reducer on a Hadoop Cluster which is set up locally.

我一直在使用Python和MRJob库作为类.

I've been using Python along with the MRJob library for the class.

这是我当前用于Mapper/Reducer的Python代码.

Here is my current Python Code for the Mapper/Reducer.

from mrjob.job import MRJob from mrjob.step import MRStep import re import os WORD_RE = repile(r"[\w']+") choice = "" class MRPrepositionsFinder(MRJob): def steps(self): return [ MRStep(mapper=self.mapper_get_words), MRStep(reducer=self.reducer_find_prep_word) ] def mapper_get_words(self, _, line): # set word_list to indicators, convert to lowercase, and strip whitespace word_list = set(line.lower().strip() for line in open("/hdfs/user/user/indicators.txt")) # set filename to map_input_file fileName = os.environ['map_input_file'] # itterate through each word in line for word in WORD_RE.findall(line): # if word is in indicators, yield chocie as filename if word.lower() in word_list: choice = fileName.split('/')[5] yield (choice, 1) def reducer_find_prep_word(self, choice, counts): # each item of choice is (choice, count), # so yielding results in value=choice, key=count yield (choice, sum(counts)) if __name__ == '__main__': MRPrepositionsFinder.run()

当我尝试在Hadoop集群上运行代码时-我使用了以下命令:

When I try to run the code on my Hadoop Cluster - I used the following command:

python hrc_discover.py /hdfs/user/user/HRCmail/* -r hadoop --hadoop-bin /usr/bin/hadoop > /hdfs/user/user/output

不幸的是，每次我运行命令时，都会出现以下错误:

Unfortunately every time I run the command I get the following error:

No configs found; falling back on auto-configuration STDERR: Error: JAVA_HOME is not set and could not be found. Traceback (most recent call last): File "hrc_discover.py", line 37, in MRPrepositionsFinder.run() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/job.py", line 432, in run mr_job.execute() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/job.py", line 453, in execute super(MRJob, self).execute() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/launch.py", line 161, in execute self.run_job() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/launch.py", line 231, in run_job runner.run() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/runner.py", line 437, in run self._run() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/hadoop.py", line 346, in _run self._find_binaries_and_jars() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/hadoop.py", line 361, in _find_binaries_and_jars self.get_hadoop_version() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/hadoop.py", line 198, in get_hadoop_version return self.fs.get_hadoop_version() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/fs/hadoop.py", line 117, in get_hadoop_version stdout = self.invoke_hadoop(['version'], return_stdout=True) File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/fs/hadoop.py", line 172, in invoke_hadoop raise CalledProcessError(proc.returncode, args) subprocess.CalledProcessError: Command '['/usr/bin/hadoop', 'version']' returned non-zero exit status 1

我环顾了互联网，发现我需要导出JAVA_HOME变量-但我不想设置任何可能破坏设置的内容.

I looked around the internet and found out that I need to export my JAVA_HOME variable - but I don't want to set anything that might break my setup.

任何帮助，将不胜感激，谢谢！

Any help with this would be much appreciated, thanks!

推荐答案

问题似乎出在 etc/hadoop/hadoop-env.sh 脚本文件中.

It seems like the issue was in the etc/hadoop/hadoop-env.sh script file.

JAVA_HOME 环境变量配置为:

export JAVA_HOME=$(JAVA_HOME)

所以我继续将其更改为以下内容:

So I went ahead and changed it to the following:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk

我试图再次运行以下命令，希望它能起作用:

I attempted to run the following command again, in hopes that it would work:

python hrc_discover.py /hdfs/user/user/HRCmail/* -r hadoop --hadoop-bin /usr/bin/hadoop > /hdfs/user/user/output

非常感谢MRJob在JAVA_HOME环境中的学习，并产生了以下输出:

Thankfully MRJob picked up on the JAVA_HOME environment and resulted in the following output:

No configs found; falling back on auto-configuration Using Hadoop version 2.7.3 Looking for Hadoop streaming jar in /home/hadoop/contrib... Looking for Hadoop streaming jar in /usr/lib/hadoop-mapreduce... Hadoop streaming jar not found. Use --hadoop-streaming-jar Creating temp directory /tmp/hrc_discover.user.20170306.022649.449218 Copying local files to hdfs:///user/user/tmp/mrjob/hrc_discover.user.20170306.022649.449218/files/... ..

要解决Hadoop流媒体jar的问题，我在命令中添加了以下开关:

To fix the issue with the Hadoop streaming jar, I added the following switch to the command:

--hadoop-streaming-jar /usr/lib/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar

完整的命令如下所示:

python hrc_discover.py /hdfs/user/user/HRCmail/* -r hadoop --hadoop-streaming-jar /usr/lib/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar --hadoop-bin /usr/bin/hadoop > /hdfs/user/user/output

结果如下:

No configs found; falling back on auto-configuration Using Hadoop version 2.7.3 Creating temp directory /tmp/hrc_discover.user.20170306.022649.449218 Copying local files to hdfs:///user/user/tmp/mrjob/hrc_discover.user.20170306.022649.449218/files/...