我正在尝试从蜂巢中的db.abc中选择*,此蜂巢表是使用spark加载的
它不起作用显示错误:
错误:java.io.IOException:java.lang.IllegalArgumentException:bucketId超出范围:-1(状态=,代码= 0)
使用以下属性时,我可以查询配置单元:
set hive.mapred.mode = nonstrict;设置hive.optimize.ppd = true;设置hive.optimize.index.filter = true;设置hive.tez.bucket.pruning = true;设置hive.explain.user = false;设置hive.fetch.task.conversion = none;现在,当我尝试使用spark读取相同的配置单元表db.abc时,出现以下错误:
仅当客户具有以下条件时,他们才能访问此表功能:CONNECTORREAD,HIVEFULLACIDREAD,HIVEFULLACIDWRITE,HIVEMANAGESTATS,HIVECACHEINVALIDATE,CONNECTORWRITE.该表可以是Hive管理的ACID表,也可以要求其他一些表Spark当前未实现的功能;在org.apache.spark.sql.catalyst.catalog.CatalogUtils $ .throwIfNoAccess(ExternalCatalogUtils.scala:280)在org.apache.spark.sql.hive.HiveTranslationLayerCheck $$ anonfun $ apply $ 1.applyOrElse(HiveTranslationLayerStrategies.scala:105)在org.apache.spark.sql.hive.HiveTranslationLayerCheck $$ anonfun $ apply $ 1.applyOrElse(HiveTranslationLayerStrategies.scala:85)在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformUp $ 1.apply(TreeNode.scala:289)在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformUp $ 1.apply(TreeNode.scala:289)在org.apache.spark.sql.catalyst.trees.CurrentOrigin $ .withOrigin(TreeNode.scala:70)在org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 3.apply(TreeNode.scala:286)在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 3.apply(TreeNode.scala:286)在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 4.apply(TreeNode.scala:306)在org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)在org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)在org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 3.apply(TreeNode.scala:286)在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 3.apply(TreeNode.scala:286)在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 4.apply(TreeNode.scala:306)在org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)在org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)在org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)在org.apache.spark.sql.hive.HiveTranslationLayerCheck.apply(HiveTranslationLayerStrategies.scala:85)在org.apache.spark.sql.hive.HiveTranslationLayerCheck.apply(HiveTranslationLayerStrategies.scala:83)在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1 $$ anonfun $ apply $ 1.apply(RuleExecutor.scala:87)在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1 $$ anonfun $ apply $ 1.apply(RuleExecutor.scala:84)在scala.collection.LinearSeqOptimized $ class.foldLeft(LinearSeqOptimized.scala:124)在scala.collection.immutable.List.foldLeft(List.scala:84)在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1.apply(RuleExecutor.scala:84)在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1.apply(RuleExecutor.scala:76)在scala.collection.immutable.List.foreach(List.scala:392)在org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)在org.apache.spark.sql.catalyst.analysis.Analyzer $ apache $ spark $ sql $ catalyst $ analysis $ Analyzer $$ executeSameContext(Analyzer.scala:124)在org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:118)在org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:103)在org.apache.spark.sql.execution.QueryExecution.analyzed $ lzycompute(QueryExecution.scala:57)在org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)在org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)在org.apache.spark.sql.Dataset $ .ofRows(Dataset.scala:74)在org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)... 49消失
我是否需要在spark-submit或shell中添加任何属性?或使用spark读取此hiv e表的另一种方法是
蜂巢表样本格式:
CREATE TABLE`hive``(||c_id十进制(11,0)等...行格式序列||'org.apache.hadoop.hive.ql.io.orc.OrcSerde'||与SERDEPROPERTIES(存储为INPUTFORMAT ||'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'||输出格式||'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'|位置||path ='hdfs://gjuyada/bbts/scl/raw'||TBLPROPERTIES(||'bucketing_version'='2',||'spark.sql.create.version'='2.3.2.3.1.0.0-78',||'spark.sql.sources.provider'='orc',||'spark.sql.sources.schema.numParts'='1',||'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"Czz_ID","type":"decimal(11,0)","nullable":true,"metadata":{}},{"name":"DzzzC_CD","type":"string","nullable":true,"metadata":{}},{"name":"C0000_S_N","type":"decimal(11,0)","nullable":true,"metadata":{}},{"name":"P_ _NB","type":"decimal(11,0)","nullable":true,"metadata":{}},{{name}:"C_YYYY","type":"string","nullable":true,"metadata":{}},"type":"string","nullable":true,"metadata":{}},{"name":"Cv_ID","type":"string","nullable":true,"metadata":{}},|'transactional'='true',||'transient_lastDdlTime'='1574817059') 解决方案您正在尝试阅读的问题 事务表 (transactional = true)插入Spark.
Hive-ACID 表尚未正式支持 Spark 酸表的完全转储/增量转储 到常规的 有一个开放的Jira saprk-15348 ,以增加阅读支持 Hive ACID 表. 如果在Acid表(来自蜂巢)上运行 主要压实 ,则火花能够读取 base_XXX 目录,但不提供增量目录 Spark-16996 . 有一些变通办法,可以使用 SPARK-LLAP ,如本链接中所述. 我认为从 HDP-3.X 开始 HiveWareHouseConnector 能够支持读取HiveAcid表. 您可以将事务表的 snapshot 创建为 non transactional ,然后从表中读取数据. 创建表< non_trans>储存为orc as select * from< transactional_table> 更新: 1.创建一个外部配置单元表: 2.然后使用现有的事务表数据覆盖上面的外部表. i am trying to do select * from db.abc in hive,this hive table was loaded using spark it does not work shows an error: Error: java.io.IOException: java.lang.IllegalArgumentException:
bucketId out of range: -1 (state=,code=0) when i use the following properties i was able to query for hive: now when i try to read the same hive table db.abc using spark , i am recieving the error as below: Clients can access this table only if they have the following
capabilities:
CONNECTORREAD,HIVEFULLACIDREAD,HIVEFULLACIDWRITE,HIVEMANAGESTATS,HIVECACHEINVALIDATE,CONNECTORWRITE.
This table may be a Hive-managed ACID table, or require some other
capability that Spark currently does not implement; at
org.apache.spark.sql.catalyst.catalog.CatalogUtils$.throwIfNoAccess(ExternalCatalogUtils.scala:280)
at
org.apache.spark.sql.hive.HiveTranslationLayerCheck$$anonfun$apply$1.applyOrElse(HiveTranslationLayerStrategies.scala:105)
at
org.apache.spark.sql.hive.HiveTranslationLayerCheck$$anonfun$apply$1.applyOrElse(HiveTranslationLayerStrategies.scala:85)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at
org.apache.spark.sql.hive.HiveTranslationLayerCheck.apply(HiveTranslationLayerStrategies.scala:85)
at
org.apache.spark.sql.hive.HiveTranslationLayerCheck.apply(HiveTranslationLayerStrategies.scala:83)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
at
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84) at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392) at
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:124)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:118)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:103)
at
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) at
org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49
elided do i need to add any properties in spark-submit or shell ? or what is the alternate way to read this hiv e table using spark hive table sample format:
The issue you are trying to reading Transactional table(transactional = true) into Spark. Officially Spark not yet supported for Hive-ACID table, get a
full dump/incremental dump of acid table to regular hive orc/parquet partitioned table then read the data using spark. There is a Open Jira saprk-15348 to add support for reading Hive ACID table. If you run major compaction on Acid table(from hive) then spark able to read base_XXX directories only but not delta directories Spark-16996 addressed in this jira. There are some workaround to read acid tables using SPARK-LLAP as mentioned in this link. I think starting from HDP-3.X HiveWareHouseConnector is able to support to read HiveAcid tables. You can create an snapshot of the transactional table as non transactional and then read the data from the table. create table <non_trans> stored as orc as select * from <transactional_table> UPDATE: 1.Create an external hive table: 2.Then overwrite to the above external table with existing transactional table data.
更多推荐
读取配置单元表时,火花引发错误
发布评论