我有一个类型为Dataset [(String,Map [String,String])]的Spark数据集。
I have a Spark Dataset of type Dataset[(String, Map[String, String])].
我必须将其插入Cassandra表中。
I have to insert the same into a Cassandra table.
这里,数据集的键[( String ,Map [String,String])]将成为我在Cassandra中行的主键。
Here, key in the Dataset[(String, Map[String, String])] will become my primary key of the row in Cassandra.
数据集中的地图[(String, Map [String,String] )]]将位于列的同一行 ColumnNameValueMap 。
The Map in the Dataset[(String, Map[String, String])] will go in the same row in a column ColumnNameValueMap.
数据集可以包含数百万行。
The Dataset can have millions of rows.
我也想以最佳方式执行操作(例如批量插入等)。
I also want to do it in optimum way (e.g. batch insert Etc.)
我的Cassandra表结构为:
My Cassandra table structure is:
CREATE TABLE SampleKeyspace.CassandraTable ( RowKey text PRIMARY KEY, ColumnNameValueMap map<text,text> );请提出建议。
推荐答案您所需的一切只是使用 Spark Cassandra Connector (最好采用刚刚发布的2.5.0版)。它提供阅读和为数据集编写函数,因此在您的情况下,将只是
Everything that you need is just to use Spark Cassandra Connector (better to take version 2.5.0 that was just released). It provides read & write functions for datasets, so in your case it will be just
import org.apache.spark.sql.cassandra._ your_data.write.cassandraFormat("CassandraTable", "SampleKeyspace").mode("append").save()如果您的表尚不存在,则您可以根据数据集本身的结构来创建它-有2个函数: createCassandraTable & createCassandraTableEx -最好使用2nd,因为它提供了对表创建的更多控制。
If your table don't exist yet, then you can create it base don the structure of the dataset itself - there are 2 functions: createCassandraTable & createCassandraTableEx - it's better to use 2nd, as it provides more control over table creation.
P.S。您可以在以下博客帖子。
P.S. You can find more about 2.5.0 release in the following blog post.
更多推荐
将Spark Dataset [(String,Map [String,String])]插入Cassandra表
发布评论