SingleColumnValueFilter没有返回适当的行数

编程入门行业动态更新时间:2024-10-28 19:19:24

本文介绍了SingleColumnValueFilter没有返回适当的行数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述在我们的HBase表中，每一行都有一个名为抓取标识的列。使用MapReduce作业，我们只想在任何时候处理来自给定爬网的行。为了更高效地运行这个工作，我们给了我们的扫描对象一个过滤器（我们希望）会除去那些具有给定的爬行标识符的行。然而，我们很快发现我们的工作没有处理正确的行数。

我写了一个测试映射器来简单地计算具有正确爬网标识符的行数，没有任何过滤器。它遍历表中的所有行，并计算正确的预期行数（〜15000）。当我们做了同样的工作，添加一个过滤器的扫描对象，计数下降到~3000。在这两个任务之间或之间没有对表本身的操作。

由于添加扫描过滤器导致可见行发生如此巨大的变化，我们期望

我们的MapReduce作业具有一个映射器：

public static class RowCountMapper extends TableMapper< ImmutableBytesWritable，Put> { public String crawlIdentifier; //计数器私有静态枚举CountRows { ROWS_WITH_MATCHED_CRAWL_IDENTIFIER } @Override public void setup（Context context ）{ Configuration configuration = context.getConfiguration（）; crawlIdentifier = configuration.get（ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY）; $ b @Override public void map（ImmutableBytesWritable legacykey，Result row，Context context）{ String rowIdentifier = HBaseSchema.getValueFromRow（row， HBaseSchema.CRAWL_IDENTIFIER_COLUMN）; if（StringUtils.equals（crawlIdentifier，rowIdentifier））{ context.getCounter（CountRows.ROWS_WITH_MATCHED_CRAWL_IDENTIFIER）.increment（1l）; $ p $过滤器设置就像这样：
字符串crawlIdentifier =配置.get（ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY）; if（StringUtils.isBlank（crawlIdentifier））{ throw new IllegalArgumentException（Crawl Identifier not set。）; $ b $ //建立一个HBase扫描器扫描扫描=新扫描（）; SingleColumnValueFilter filter = new SingleColumnValueFilter（HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getFamily（）， HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getQualifier（）， CompareOp.EQUAL， Bytes.toBytes（crawlIdentifier））; filter.setFilterIfMissing（true）; scan.setFilter（filter）;
我们使用了错误的过滤器，还是配置错了？ b
编辑：我们正在按照 issues.apache/jira/browse/HBASE-2198 ，但我很确定扫描包括所有的家庭默认情况下。
解决方案
过滤器看起来是正确的，但是在某些情况下，可能会导致这种情况与字符编码有关。您的过滤器正在使用使用UTF8 [1]的Bytes.toBytes（String），而您可能在HBaseSchema中使用本地字符编码，或者如果您使用String.getBytes（）[2 ]。检查crawlIdentifier最初写入HBase使用以下，以确保过滤器比较像过滤扫描。
Bytes.toBytes（crawlIdentifier）
[1] hbase.apache/apidocs/org/apache/hadoop /hbase/util/Bytes.html#toBytes(java.lang.String ） [2] docs.oracle/javase/1.4.2/docs/api/java/lang/String.html#getBytes （）

In our HBase table, each row has a column called crawl identifier. Using a MapReduce job, we only want to process at any one time rows from a given crawl. In order to run the job more efficiently we gave our scan object a filter that (we hoped) would remove all rows except those with the given crawl identifier. However, we quickly discovered that our jobs were not processing the correct number of rows.

I wrote a test mapper to simply count the number of rows with the correct crawl identifier, without any filters. It iterated over all the rows in the table and counted the correct, expected number of rows (~15000). When we took that same job, added a filter to the scan object, the count dropped to ~3000. There was no manipulation of the table itself during or in between these two jobs.

Since adding the scan filter caused the visible rows to change so dramatically, we expect that we simply built the filter incorrectly.

Our MapReduce job features a single mapper:
public static class RowCountMapper extends TableMapper<ImmutableBytesWritable, Put>{ public String crawlIdentifier; // counters private static enum CountRows { ROWS_WITH_MATCHED_CRAWL_IDENTIFIER } @Override public void setup(Context context){ Configuration configuration=context.getConfiguration(); crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY); } @Override public void map(ImmutableBytesWritable legacykey, Result row, Context context){ String rowIdentifier=HBaseSchema.getValueFromRow(row, HBaseSchema.CRAWL_IDENTIFIER_COLUMN); if (StringUtils.equals(crawlIdentifier, rowIdentifier)){ context.getCounter(CountRows.ROWS_WITH_MATCHED_CRAWL_IDENTIFIER).increment(1l); } } }
The filter setup is like this:
String crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY); if (StringUtils.isBlank(crawlIdentifier)){ throw new IllegalArgumentException("Crawl Identifier not set."); } // build an HBase scanner Scan scan=new Scan(); SingleColumnValueFilter filter=new SingleColumnValueFilter(HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getFamily(), HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getQualifier(), CompareOp.EQUAL, Bytes.toBytes(crawlIdentifier)); filter.setFilterIfMissing(true); scan.setFilter(filter);
Are we using the wrong filter, or have we configured it wrong?

EDIT: we're looking at manually adding all the column families as per issues.apache/jira/browse/HBASE-2198 but I'm pretty sure the Scan includes all the families by default.
解决方案
The filter looks correct, but under certain conditions one scenario that could cause this relates to character encodings. Your Filter is using Bytes.toBytes(String) which uses UTF8 [1], whereas you might be using native character encoding in HBaseSchema or when you write the record if you use String.getBytes()[2]. Check that the crawlIdentifier was originally written to HBase using the following to ensure the filter is comparing like for like in the filtered scan.
Bytes.toBytes(crawlIdentifier)
[1] hbase.apache/apidocs/org/apache/hadoop/hbase/util/Bytes.html#toBytes(java.lang.String) [2] docs.oracle/javase/1.4.2/docs/api/java/lang/String.html#getBytes()

更多推荐

SingleColumnValueFilter没有返回适当的行数

本文发布于:2023-10-24 00:14:18，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1522413.html

版权声明:本站内容均来自互联网，仅供演示用，请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系，我们将在24小时内删除。

行数 SingleColumnValueFilter

上一篇：视图控制器有时不会收到NSNotification

下一篇：没有行数据时返回零值

发布评论取消回复

评论列表（有 0 条评论）

最近发表

荆门网站建设的重要性

win10蓝屏终止代码CRITICAL_PROCESS_DIED解决方法

您可以尝试添加 --skip-broken 选项来解决该问题您可以尝试执行：rpm -Va --nofiles --nodigest 解决方案

关于无线网络波动大的解决办法

Windows10 关于系统中断CPU占用过高导致电脑变卡的解决办法

VS 2019 点击页面自动定位到解决方案资源管理器目录位置

（亲测解决）VMware打开需要半天才进入、打开系统很慢、运行很慢解决办法

Typora官网下载的最新版本mac10.13以下版本用不了的解决办法

成功解决ModuleNotFoundError: No module named ‘torch._C‘

MySQL:由于找不到VCRUNTIME140_1.dll，无法继续执行代码。重新安装程序可能会解决此问题

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍！

热门文章

从源“http://localhost:5173”访问“...”处的 XMLHttpRequest 已被 CORS 策略阻止

币安API错误代码1102，未发送强制参数“时间戳”

如果我在bot telegram nodejs中使用editMessageMedia，我如何制作标题

在 Node.js 中从网络流创建 blob

使用 Node.js / ES6 如何设置 dotenv 文件的自定义路径？

使用 NODE.JS 和 html5 实现低延迟（50 毫秒）视频流

如何从nodejs连接laravel>laravel

使用nodejs观看目录

如果文件包含特定字符串，如何跳过 GitHub 工作流程步骤？

FirebaseError：无法从.env加载环境变量

标签列表

文件

如何在

Python

系统

java

方法

数据

错误

windows

函数

android

linux

教程

如何使用

代码

字符串

计算机

电脑

服务器

NET

应用程序

数组

PHP

MySQL

SQL

对象

项目

程序

数据库

word