我需要将艺术家从行转移到列,因此模式将为: > UserGUID字符串,Artist1 Int,Artist2 Int,... Artist8000 Int 艺术家的播放次数由各个用户计数
有一种方法建议在如何将行转换为BigQuery / SQL中具有大量数据的列?和,但看起来它不能为我的示例中的数字进行缩放
这个方法可以扩展为我的例子吗?
低于6000的方法功能和它按预期工作。我相信它可以达到10K的功能,这是对表格中列数的硬限制第1步 - 用户/艺术家
选择userGUID as uid,artistGUID as aid,COUNT(1)as play FROM [mydataset.stats] GROUP BY 1,2第2步 - 标准化uid和aid - so它们是连续的数字1,2,3 ......。 我们至少需要这样做有两个原因:a)以后动态创建sql尽可能紧凑,并且b)有更多可用/友好的列名称
结合第一步 - 这将是: $ p $ SELECT u .uid AS uid,a.aid AS aid播放 FROM( SELECT userGUID,artistGUID,COUNT(1)AS播放 FROM [mydataset.stats] GROUP BY 1 ,2 )AS s JOIN( SELECT userGUID,ROW_NUMBER()OVER()AS uid FROM [mydataset.stats] GROUP BY 1 )AS u ON u。 userGUID = s.userGUID JOIN( SELECT artistGUID,ROW_NUMBER()OVER()AS援助自[mydataset.stats] GROUP BY 1 )作为ON a.artistGUID = s.artistGUID
让我们将输出写入表 - mydataset.aggs
第3步 - 每次使用N个特征(艺术家)的已建议(在上述问题中)的方法。 在我的具体例子中,通过实验,我发现基本方法对于2000到3000之间的特征数量效果很好。为了安全起见,我决定一次使用2000个特征
以下脚本用于动态生成查询,然后运行以创建分区表
SELECT'SELECT uid,'+ GROUP_CONCAT_UNQUOTED('SUM(IF(aid ='+ STRING(aid)+',plays,NULL))as'+ STRING(aid) ) +'FROM [mydataset.aggs] GROUP EACH BY uid' FROM(SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid> 0 and aid< 2001)上面的查询生成另一个查询,如下所示: <$ p $ (IF(aid = 1,plays,NULL))a1,SUM(IF(aid = 3,plays,NULL))a3, SUM(IF = 2,播放,NULL))a2,SUM(IF(aid = 4,plays,NULL))a4。 。 。 FROM [mydataset.aggs] GROUP EACH BY uid
这应该运行并写入 mydataset.pivot_1_2000
执行STEP 3两次(调整 HAVING aid> NNNN和援助< NNNN ),我们得到三个表 mydataset.pivot_2001_4000 , mydataset.pivot_4001_6000 正如你所看到的 - mydataset.pivot_1_2000预期的模式,但从1到2001年的援助功能; mydataset.pivot_2001_4000仅具有2001年至4000年的援助功能;等等
第4步 - 将所有分区数据透视表合并到最终数据透视表中,并将所有特性表示为一个表中的列
与上述步骤相同。首先,我们需要生成查询,然后运行它所以,最初我们将缝合mydataset.pivot_1_2000和mydataset.pivot_2001_4000。然后结果为mydataset.pivot_4001_6000
SELECT'选择x.uid uid,'+ GROUP_CONCAT_UNQUOTED( 'a'+ STRING(aid)) +'FROM [mydataset.pivot_1_2000] AS x JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid ' FROM(SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid< 4001 ORDER BY aid)应该运行上面的输出字符串并将结果写入 mydataset.pivot_1_4000
然后我们重复如下所示的STEP 4
SELECT'选择x.uid uid,'+ GROUP_CONCAT_UNQUOTED('a'+ STRING(aid)) +'FROM [mydataset.pivot_1_4000] AS x 加入每个[mydataset.pivot_4001_6000] AS y ON y.uid = x.uid ' FROM(SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid< 6001 ORDER BY aid)将结果写入 mydatas et.pivot_1_6000
结果表格具有以下模式:
uid int,a1 int,a2 int,a3 int,。 。 。 ,a5999 int,a6000 int注意: a 。我尝试了这种方法,但最多只能使用6000种功能,并且按照预期的那样工作 - b 。第3步和第4步中的第二个/主要查询的运行时间从20分钟到60分钟不等 c 。重要提示:步骤3和步骤4中的计费层从1到90不等。好消息是相应的表的大小相对较小(30-40MB),计费字节也是如此。对于2016年以前的项目,所有项目都被视为第1级,但2016年10月之后,这可能会成为问题。 更多信息,请参见时间 href =cloud.google/bigquery/pricing#high-compute =noreferrer>高级计算查询 d 。以上例子展示用BigQuery进行大规模数据转换的功能!尽管如此,我认为(但我可能错了)存储物化特征矩阵并不是最好的想法
Let's say, I have music video play stats table mydataset.stats for a given day (3B rows, 1M users, 6K artists). Simplified schema is: UserGUID String, ArtistGUID String
I need pivot/transpose artists from rows to columns, so schema will be: UserGUID String, Artist1 Int, Artist2 Int, … Artist8000 Int With Artist plays count by respective user
There was an approach suggested in How to transpose rows to columns with large amount of the data in BigQuery/SQL? and How to create dummy variable columns for thousands of categories in Google BigQuery? but looks like it doesn’t scale for numbers I have in my example
Can this approach be scaled for my example?
解决方案I tried below approach for up to 6000 features and it worked as expected. I believe it will work up to 10K features which is hard limit for number of columns in a table
STEP 1 - Aggregate plays by user / artist
SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays FROM [mydataset.stats] GROUP BY 1, 2STEP 2 – Normalize uid and aid – so they are consecutive numbers 1, 2, 3, … . We need this at least for two reasons: a) make later dynamically created sql as compact as possible and b) to have more usable/friendly columns names
Combined with first step – it will be:
SELECT u.uid AS uid, a.aid AS aid, plays FROM ( SELECT userGUID, artistGUID, COUNT(1) AS plays FROM [mydataset.stats] GROUP BY 1, 2 ) AS s JOIN ( SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1 ) AS u ON u. userGUID = s.userGUID JOIN ( SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1 ) AS a ON a.artistGUID = s.artistGUIDLet’s write output to table - mydataset.aggs
STEP 3 – Using already suggested (in above mentioned questions) approach for N features (artists) at a time. In my particular example, by experimenting, I found that basic approach works well for number of features between 2000 and 3000. To be on safe side I decided to use 2000 features at a time
Below script is used for dynamically generating query that then run to create partitioned tables
SELECT 'SELECT uid,' + GROUP_CONCAT_UNQUOTED( 'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid) ) + ' FROM [mydataset.aggs] GROUP EACH BY uid' FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001)Above query produces yet another query like below:
SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3, SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . . FROM [mydataset.aggs] GROUP EACH BY uidThis should be run and written to mydataset.pivot_1_2000
Executing STEP 3 two more times (adjusting HAVING aid > NNNN and aid < NNNN) we get three more tables mydataset.pivot_2001_4000, mydataset.pivot_4001_6000 As you can see - mydataset.pivot_1_2000 has expected schema but for features with aid from 1 to 2001; mydataset.pivot_2001_4000 has only features with aid from 2001 to 4000; and so on
STEP 4 – Merging all partitioned pivot table to final pivot table with all features represented as columns in one table
Same as in above steps. First we need generate query and then run it So, initially we will "stitch" mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then result with mydataset.pivot_4001_6000
SELECT 'SELECT x.uid uid,' + GROUP_CONCAT_UNQUOTED( 'a' + STRING(aid) ) + ' FROM [mydataset.pivot_1_2000] AS x JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid ' FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid)Output string from above should be run and result written to mydataset.pivot_1_4000
Then we repeat STEP 4 like below
SELECT 'SELECT x.uid uid,' + GROUP_CONCAT_UNQUOTED( 'a' + STRING(aid) ) + ' FROM [mydataset.pivot_1_4000] AS x JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid ' FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid)Result to be written to mydataset.pivot_1_6000
The resulted table has following schema:
uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 intNOTE: a. I tried this approach only up to 6000 features and it worked as expected b. Run time for second/main queries in step 3 and 4 varied from 20 to 60 min c. IMPORTANT: billing tier in steps 3 and 4 varied from 1 to 90. The good news is that respective table’s size is relatively small (30-40MB) so does billing bytes. For "before 2016" projects everything is billed as tier 1 but after October 2016 this can be an issue. For more information, see Timing in High-Compute queries d. Above example shows power of large-scale data transformation with BigQuery! Still I think (but I can be wrong) that storing materialized feature matrix is not the best idea
更多推荐
如何在BigQuery中缩放转轴?
发布评论