读取配置单元表时,火花引发错误

编程入门 行业动态 更新时间:2024-10-28 06:34:25
本文介绍了读取配置单元表时,火花引发错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我正在尝试从蜂巢中的db.abc中选择*,此蜂巢表是使用spark加载的

它不起作用显示错误:

错误:java.io.IOException:java.lang.IllegalArgumentException:bucketId超出范围:-1(状态=,代码= 0)

使用以下属性时,我可以查询配置单元:

set hive.mapred.mode = nonstrict;设置hive.optimize.ppd = true;设置hive.optimize.index.filter = true;设置hive.tez.bucket.pruning = true;设置hive.explain.user = false;设置hive.fetch.task.conversion = none;

现在,当我尝试使用spark读取相同的配置单元表db.abc时,出现以下错误:

仅当客户具有以下条件时,他们才能访问此表功能:CONNECTORREAD,HIVEFULLACIDREAD,HIVEFULLACIDWRITE,HIVEMANAGESTATS,HIVECACHEINVALIDATE,CONNECTORWRITE.该表可以是Hive管理的ACID表,也可以要求其他一些表Spark当前未实现的功能;在org.apache.spark.sql.catalyst.catalog.CatalogUtils $ .throwIfNoAccess(ExternalCatalogUtils.scala:280)在org.apache.spark.sql.hive.HiveTranslationLayerCheck $$ anonfun $ apply $ 1.applyOrElse(HiveTranslationLayerStrategies.scala:105)在org.apache.spark.sql.hive.HiveTranslationLayerCheck $$ anonfun $ apply $ 1.applyOrElse(HiveTranslationLayerStrategies.scala:85)在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformUp $ 1.apply(TreeNode.scala:289)在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformUp $ 1.apply(TreeNode.scala:289)在org.apache.spark.sql.catalyst.trees.CurrentOrigin $ .withOrigin(TreeNode.scala:70)在org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 3.apply(TreeNode.scala:286)在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 3.apply(TreeNode.scala:286)在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 4.apply(TreeNode.scala:306)在org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)在org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)在org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 3.apply(TreeNode.scala:286)在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 3.apply(TreeNode.scala:286)在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 4.apply(TreeNode.scala:306)在org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)在org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)在org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)在org.apache.spark.sql.hive.HiveTranslationLayerCheck.apply(HiveTranslationLayerStrategies.scala:85)在org.apache.spark.sql.hive.HiveTranslationLayerCheck.apply(HiveTranslationLayerStrategies.scala:83)在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1 $$ anonfun $ apply $ 1.apply(RuleExecutor.scala:87)在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1 $$ anonfun $ apply $ 1.apply(RuleExecutor.scala:84)在scala.collection.LinearSeqOptimized $ class.foldLeft(LinearSeqOptimized.scala:124)在scala.collection.immutable.List.foldLeft(List.scala:84)在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1.apply(RuleExecutor.scala:84)在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1.apply(RuleExecutor.scala:76)在scala.collection.immutable.List.foreach(List.scala:392)在org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)在org.apache.spark.sql.catalyst.analysis.Analyzer $ apache $ spark $ sql $ catalyst $ analysis $ Analyzer $$ executeSameContext(Analyzer.scala:124)在org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:118)在org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:103)在org.apache.spark.sql.execution.QueryExecution.analyzed $ lzycompute(QueryExecution.scala:57)在org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)在org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)在org.apache.spark.sql.Dataset $ .ofRows(Dataset.scala:74)在org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)... 49消失

我是否需要在spark-submit或shell中添加任何属性?或使用spark读取此hiv e表的另一种方法是

蜂巢表样本格式:

CREATE TABLE`hive``(||c_id十进制(11,0)等...行格式序列||'org.apache.hadoop.hive.ql.io.orc.OrcSerde'||与SERDEPROPERTIES(存储为INPUTFORMAT ||'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'||输出格式||'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'|位置||path ='hdfs://gjuyada/bbts/scl/raw'||TBLPROPERTIES(||'bucketing_version'='2',||'spark.sql.create.version'='2.3.2.3.1.0.0-78',||'spark.sql.sources.provider'='orc',||'spark.sql.sources.schema.numParts'='1',||'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"Czz_ID","type":"decimal(11,0)","nullable":true,"metadata":{}},{"name":"DzzzC_CD","type":"string","nullable":true,"metadata":{}},{"name":"C0000_S_N","type":"decimal(11,0)","nullable":true,"metadata":{}},{"name":"P_ _NB","type":"decimal(11,0)","nullable":true,"metadata":{}},{{name}:"C_YYYY","type":"string","nullable":true,"metadata":{}},"type":"string","nullable":true,"metadata":{}},{"name":"Cv_ID","type":"string","nullable":true,"metadata":{}},|'transactional'='true',||'transient_lastDdlTime'='1574817059')

解决方案

您正在尝试阅读的问题 事务表 (transactional = true)插入Spark.

Hive-ACID 表尚未正式支持 Spark 酸表的完全转储/增量转储 到常规的

有一个开放的Jira saprk-15348 ,以增加阅读支持 Hive ACID 表.

  • 如果在Acid表(来自蜂巢)上运行 主要压实 ,则火花能够读取 base_XXX

  • 有一些变通办法,可以使用 SPARK-LLAP ,如本链接中所述.

  • 我认为从 HDP-3.X 开始 HiveWareHouseConnector 能够支持读取HiveAcid表.

  • 您可以将事务表的 snapshot 创建为 non transactional ,然后从表中读取数据.

  • >

    创建表< non_trans>储存为orc as select * from< transactional_table>

更新:

1.创建一个外部配置单元表:

创建外部表`< ext_tab_name>`(< col_name>< data_type> ....等)存储为兽人位置<路径>";

2.然后使用现有的事务表数据覆盖上面的外部表.

插入覆盖表< ext_tab_name>从< transactional_tab_name>中选择*;

i am trying to do select * from db.abc in hive,this hive table was loaded using spark

it does not work shows an error:

Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0)

when i use the following properties i was able to query for hive:

set hive.mapred.mode=nonstrict; set hive.optimize.ppd=true; set hive.optimize.index.filter=true; set hive.tez.bucket.pruning=true; set hive.explain.user=false; set hive.fetch.task.conversion=none;

now when i try to read the same hive table db.abc using spark , i am recieving the error as below:

Clients can access this table only if they have the following capabilities: CONNECTORREAD,HIVEFULLACIDREAD,HIVEFULLACIDWRITE,HIVEMANAGESTATS,HIVECACHEINVALIDATE,CONNECTORWRITE. This table may be a Hive-managed ACID table, or require some other capability that Spark currently does not implement; at org.apache.spark.sql.catalyst.catalog.CatalogUtils$.throwIfNoAccess(ExternalCatalogUtils.scala:280) at org.apache.spark.sql.hive.HiveTranslationLayerCheck$$anonfun$apply$1.applyOrElse(HiveTranslationLayerStrategies.scala:105) at org.apache.spark.sql.hive.HiveTranslationLayerCheck$$anonfun$apply$1.applyOrElse(HiveTranslationLayerStrategies.scala:85) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) at org.apache.spark.sql.hive.HiveTranslationLayerCheck.apply(HiveTranslationLayerStrategies.scala:85) at org.apache.spark.sql.hive.HiveTranslationLayerCheck.apply(HiveTranslationLayerStrategies.scala:83) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) at org.apache.spark.sql.catalyst.analysis.Analyzer$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:124) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:118) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:103) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided

do i need to add any properties in spark-submit or shell ? or what is the alternate way to read this hiv e table using spark

hive table sample format:

CREATE TABLE `hive``( | | `c_id` decimal(11,0),etc......... ROW FORMAT SERDE | | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' | | WITH SERDEPROPERTIES ( STORED AS INPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' | LOCATION | | path= 'hdfs://gjuyada/bbts/scl/raw' | | TBLPROPERTIES ( | | 'bucketing_version'='2', | | 'spark.sql.create.version'='2.3.2.3.1.0.0-78', | | 'spark.sql.sources.provider'='orc', | | 'spark.sql.sources.schema.numParts'='1', | | 'spark.sql.sources.schema.part.0'='{"type":"struct","fields": [{"name":"Czz_ID","type":"decimal(11,0)","nullable":true,"metadata":{}}, {"name":"DzzzC_CD","type":"string","nullable":true,"metadata":{}}, {"name":"C0000_S_N","type":"decimal(11,0)","nullable":true,"metadata":{}}, {"name":"P_ _NB","type":"decimal(11,0)","nullable":true,"metadata":{}}, {"name":"C_YYYY","type":"string","nullable":true,"metadata":{}},"type":"string","nullable":true,"metadata":{}},{"name":"Cv_ID","type":"string","nullable":true,"metadata":{}}, | 'transactional'='true', | | 'transient_lastDdlTime'='1574817059')

解决方案

The issue you are trying to reading Transactional table(transactional = true) into Spark.

Officially Spark not yet supported for Hive-ACID table, get a full dump/incremental dump of acid table to regular hive orc/parquet partitioned table then read the data using spark.

There is a Open Jira saprk-15348 to add support for reading Hive ACID table.

  • If you run major compaction on Acid table(from hive) then spark able to read base_XXX directories only but not delta directories Spark-16996 addressed in this jira.

  • There are some workaround to read acid tables using SPARK-LLAP as mentioned in this link.

  • I think starting from HDP-3.X HiveWareHouseConnector is able to support to read HiveAcid tables.

  • You can create an snapshot of the transactional table as non transactional and then read the data from the table.

    create table <non_trans> stored as orc as select * from <transactional_table>

UPDATE:

1.Create an external hive table:

CREATE external TABLE `<ext_tab_name>`( <col_name> <data_type>....etc ) stored as orc location '<path>';

2.Then overwrite to the above external table with existing transactional table data.

insert overwrite table <ext_tab_name> select * from <transactional_tab_name>;

更多推荐

读取配置单元表时,火花引发错误

本文发布于:2023-10-24 09:17:25,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1523551.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:火花   单元   错误

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!