我写了一个测试映射器来简单地计算具有正确爬网标识符的行数,没有任何过滤器。它遍历表中的所有行,并计算正确的预期行数(〜15000)。当我们做了同样的工作,添加一个过滤器的扫描对象,计数下降到~3000。在这两个任务之间或之间没有对表本身的操作。
由于添加扫描过滤器导致可见行发生如此巨大的变化,我们期望
我们的MapReduce作业具有一个映射器:
public static class RowCountMapper extends TableMapper< ImmutableBytesWritable,Put> { public String crawlIdentifier; //计数器私有静态枚举CountRows { ROWS_WITH_MATCHED_CRAWL_IDENTIFIER } @Override public void setup(Context context ){ Configuration configuration = context.getConfiguration(); crawlIdentifier = configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY); $ b @Override public void map(ImmutableBytesWritable legacykey,Result row,Context context){ String rowIdentifier = HBaseSchema.getValueFromRow(row, HBaseSchema.CRAWL_IDENTIFIER_COLUMN); if(StringUtils.equals(crawlIdentifier,rowIdentifier)){ context.getCounter(CountRows.ROWS_WITH_MATCHED_CRAWL_IDENTIFIER).increment(1l); $ p $过滤器设置就像这样: 字符串crawlIdentifier =配置.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY); if(StringUtils.isBlank(crawlIdentifier)){ throw new IllegalArgumentException(Crawl Identifier not set。); $ b $ //建立一个HBase扫描器扫描扫描=新扫描(); SingleColumnValueFilter filter = new SingleColumnValueFilter(HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getFamily(), HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getQualifier(), CompareOp.EQUAL, Bytes.toBytes(crawlIdentifier)); filter.setFilterIfMissing(true); scan.setFilter(filter);我们使用了错误的过滤器,还是配置错了? b
编辑:我们正在按照 issues.apache/jira/browse/HBASE-2198 ,但我很确定扫描包括所有的家庭默认情况下。
解决方案过滤器看起来是正确的,但是在某些情况下,可能会导致这种情况与字符编码有关。您的过滤器正在使用使用UTF8 [1]的Bytes.toBytes(String),而您可能在HBaseSchema中使用本地字符编码,或者如果您使用String.getBytes()[2 ]。检查crawlIdentifier最初写入HBase使用以下,以确保过滤器比较像过滤扫描。
Bytes.toBytes(crawlIdentifier)[1] hbase.apache/apidocs/org/apache/hadoop /hbase/util/Bytes.html#toBytes(java.lang.String ) [2] docs.oracle/javase/1.4.2/docs/api/java/lang/String.html#getBytes ()
In our HBase table, each row has a column called crawl identifier. Using a MapReduce job, we only want to process at any one time rows from a given crawl. In order to run the job more efficiently we gave our scan object a filter that (we hoped) would remove all rows except those with the given crawl identifier. However, we quickly discovered that our jobs were not processing the correct number of rows.
I wrote a test mapper to simply count the number of rows with the correct crawl identifier, without any filters. It iterated over all the rows in the table and counted the correct, expected number of rows (~15000). When we took that same job, added a filter to the scan object, the count dropped to ~3000. There was no manipulation of the table itself during or in between these two jobs.
Since adding the scan filter caused the visible rows to change so dramatically, we expect that we simply built the filter incorrectly.
Our MapReduce job features a single mapper:
public static class RowCountMapper extends TableMapper<ImmutableBytesWritable, Put>{ public String crawlIdentifier; // counters private static enum CountRows { ROWS_WITH_MATCHED_CRAWL_IDENTIFIER } @Override public void setup(Context context){ Configuration configuration=context.getConfiguration(); crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY); } @Override public void map(ImmutableBytesWritable legacykey, Result row, Context context){ String rowIdentifier=HBaseSchema.getValueFromRow(row, HBaseSchema.CRAWL_IDENTIFIER_COLUMN); if (StringUtils.equals(crawlIdentifier, rowIdentifier)){ context.getCounter(CountRows.ROWS_WITH_MATCHED_CRAWL_IDENTIFIER).increment(1l); } } }The filter setup is like this:
String crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY); if (StringUtils.isBlank(crawlIdentifier)){ throw new IllegalArgumentException("Crawl Identifier not set."); } // build an HBase scanner Scan scan=new Scan(); SingleColumnValueFilter filter=new SingleColumnValueFilter(HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getFamily(), HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getQualifier(), CompareOp.EQUAL, Bytes.toBytes(crawlIdentifier)); filter.setFilterIfMissing(true); scan.setFilter(filter);Are we using the wrong filter, or have we configured it wrong?
EDIT: we're looking at manually adding all the column families as per issues.apache/jira/browse/HBASE-2198 but I'm pretty sure the Scan includes all the families by default.
解决方案The filter looks correct, but under certain conditions one scenario that could cause this relates to character encodings. Your Filter is using Bytes.toBytes(String) which uses UTF8 [1], whereas you might be using native character encoding in HBaseSchema or when you write the record if you use String.getBytes()[2]. Check that the crawlIdentifier was originally written to HBase using the following to ensure the filter is comparing like for like in the filtered scan.
Bytes.toBytes(crawlIdentifier)[1] hbase.apache/apidocs/org/apache/hadoop/hbase/util/Bytes.html#toBytes(java.lang.String) [2] docs.oracle/javase/1.4.2/docs/api/java/lang/String.html#getBytes()
更多推荐
SingleColumnValueFilter没有返回适当的行数
发布评论