SingleColumnValueFilter没有返回适当的行数

编程入门 行业动态 更新时间:2024-10-28 19:19:24
本文介绍了SingleColumnValueFilter没有返回适当的行数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 在我们的HBase表中,每一行都有一个名为抓取标识的列。使用MapReduce作业,我们只想在任何时候处理来自给定爬网的行。为了更高效地运行这个工作,我们给了我们的扫描对象一个过滤器(我们希望)会除去那些具有给定的爬行标识符的行。然而,我们很快发现我们的工作没有处理正确的行数。

我写了一个测试映射器来简单地计算具有正确爬网标识符的行数,没有任何过滤器。它遍历表中的所有行,并计算正确的预期行数(〜15000)。当我们做了同样的工作,添加一个过滤器的扫描对象,计数下降到~3000。在这两个任务之间或之间没有对表本身的操作。

由于添加扫描过滤器导致可见行发生如此巨大的变化,我们期望

我们的MapReduce作业具有一个映射器:

public static class RowCountMapper extends TableMapper< ImmutableBytesWritable,Put> { public String crawlIdentifier; //计数器私有静态枚举CountRows { ROWS_WITH_MATCHED_CRAWL_IDENTIFIER } @Override public void setup(Context context ){ Configuration configuration = context.getConfiguration(); crawlIdentifier = configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY); $ b @Override public void map(ImmutableBytesWritable legacykey,Result row,Context context){ String rowIdentifier = HBaseSchema.getValueFromRow(row, HBaseSchema.CRAWL_IDENTIFIER_COLUMN); if(StringUtils.equals(crawlIdentifier,rowIdentifier)){ context.getCounter(CountRows.ROWS_WITH_MATCHED_CRAWL_IDENTIFIER).increment(1l); $ p $过滤器设置就像这样:

字符串crawlIdentifier =配置.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY); if(StringUtils.isBlank(crawlIdentifier)){ throw new IllegalArgumentException(Crawl Identifier not set。); $ b $ //建立一个HBase扫描器扫描扫描=新扫描(); SingleColumnValueFilter filter = new SingleColumnValueFilter(HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getFamily(), HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getQualifier(), CompareOp.EQUAL, Bytes.toBytes(crawlIdentifier)); filter.setFilterIfMissing(true); scan.setFilter(filter);

我们使用了错误的过滤器,还是配置错了? b

编辑:我们正在按照 issues.apache/jira/browse/HBASE-2198 ,但我很确定扫描包括所有的家庭默认情况下。

解决方案

过滤器看起来是正确的,但是在某些情况下,可能会导致这种情况与字符编码有关。您的过滤器正在使用使用UTF8 [1]的Bytes.toBytes(String),而您可能在HBaseSchema中使用本地字符编码,或者如果您使用String.getBytes()[2 ]。检查crawlIdentifier最初写入HBase使用以下,以确保过滤器比较像过滤扫描。

Bytes.toBytes(crawlIdentifier)

[1] hbase.apache/apidocs/org/apache/hadoop /hbase/util/Bytes.html#toBytes(java.lang.String ) [2] docs.oracle/javase/1.4.2/docs/api/java/lang/String.html#getBytes ()

In our HBase table, each row has a column called crawl identifier. Using a MapReduce job, we only want to process at any one time rows from a given crawl. In order to run the job more efficiently we gave our scan object a filter that (we hoped) would remove all rows except those with the given crawl identifier. However, we quickly discovered that our jobs were not processing the correct number of rows.

I wrote a test mapper to simply count the number of rows with the correct crawl identifier, without any filters. It iterated over all the rows in the table and counted the correct, expected number of rows (~15000). When we took that same job, added a filter to the scan object, the count dropped to ~3000. There was no manipulation of the table itself during or in between these two jobs.

Since adding the scan filter caused the visible rows to change so dramatically, we expect that we simply built the filter incorrectly.

Our MapReduce job features a single mapper:

public static class RowCountMapper extends TableMapper<ImmutableBytesWritable, Put>{ public String crawlIdentifier; // counters private static enum CountRows { ROWS_WITH_MATCHED_CRAWL_IDENTIFIER } @Override public void setup(Context context){ Configuration configuration=context.getConfiguration(); crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY); } @Override public void map(ImmutableBytesWritable legacykey, Result row, Context context){ String rowIdentifier=HBaseSchema.getValueFromRow(row, HBaseSchema.CRAWL_IDENTIFIER_COLUMN); if (StringUtils.equals(crawlIdentifier, rowIdentifier)){ context.getCounter(CountRows.ROWS_WITH_MATCHED_CRAWL_IDENTIFIER).increment(1l); } } }

The filter setup is like this:

String crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY); if (StringUtils.isBlank(crawlIdentifier)){ throw new IllegalArgumentException("Crawl Identifier not set."); } // build an HBase scanner Scan scan=new Scan(); SingleColumnValueFilter filter=new SingleColumnValueFilter(HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getFamily(), HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getQualifier(), CompareOp.EQUAL, Bytes.toBytes(crawlIdentifier)); filter.setFilterIfMissing(true); scan.setFilter(filter);

Are we using the wrong filter, or have we configured it wrong?

EDIT: we're looking at manually adding all the column families as per issues.apache/jira/browse/HBASE-2198 but I'm pretty sure the Scan includes all the families by default.

解决方案

The filter looks correct, but under certain conditions one scenario that could cause this relates to character encodings. Your Filter is using Bytes.toBytes(String) which uses UTF8 [1], whereas you might be using native character encoding in HBaseSchema or when you write the record if you use String.getBytes()[2]. Check that the crawlIdentifier was originally written to HBase using the following to ensure the filter is comparing like for like in the filtered scan.

Bytes.toBytes(crawlIdentifier)

[1] hbase.apache/apidocs/org/apache/hadoop/hbase/util/Bytes.html#toBytes(java.lang.String) [2] docs.oracle/javase/1.4.2/docs/api/java/lang/String.html#getBytes()

更多推荐

SingleColumnValueFilter没有返回适当的行数

本文发布于:2023-10-24 00:14:18,感谢您对本站的认可!
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:行数   SingleColumnValueFilter

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!