SQL Server 2008填补了维度的空白(SQL Server 2008 filling gaps with dimension)

我有一个数据表如下

#data --------------- Account AccountType --------------- 1 2 2 0 3 5 4 2 5 1 6 5

AccountType 2是标题，5是总计。类型2的含义帐户必须检查下一个1或0以确定其Dim值是1还是0.类型5的总计必须查找最接近的1或0以确定其Dim值。类型1或0的帐户类型为Dim。

类型2的帐户显示为孤岛，因此它不足以仅检查RowNumber + 1，同样适用于类型5的帐户。

我使用CTE来到了下表。但是找不到快速的方法从这里到我的帐户，帐户类型，所有帐户的Dim的最终结果

T3 ------------------- StartRow EndRow AccountType Dim ------------------- 1 1 2 0 2 2 0 0 3 3 5 0 4 4 2 1 5 5 0 1 6 6 5 1

下面的代码是MS TSQL复制粘贴它并查看它运行。 CTE select语句的最终连接速度非常慢，即使是500行也需要30秒。我有100.000行我需要处理。我做了一个基于游标的解决方案，它可以在10-20秒内完成，可行且快速递归CTE解决方案在5秒内完成100.000行，但它依赖于#data表的碎片。我应该补充这是简化的真正的问题有很多需要考虑的维度。但对于这个简单的问题，它也会起作用。

无论如何，使用连接或其他基于集合的解决方案有一种快速的方法。

SET NOCOUNT ON IF OBJECT_ID('tempdb..#data') IS NOT NULL DROP TABLE #data CREATE TABLE #data ( Account INTEGER IDENTITY(1,1), AccountType INTEGER, ) BEGIN -- TEST DATA DECLARE @Counter INTEGER = 0 DECLARE @MaxDataRows INTEGER = 50 -- Change here to check performance DECLARE @Type INTEGER WHILE(@Counter < @MaxDataRows) BEGIN SET @Type = CASE WHEN @Counter % 10 < 3 THEN 2 WHEN @Counter % 10 >= 8 THEN 5 WHEN @Counter % 10 >= 3 THEN (CASE WHEN @Counter < @MaxDataRows / 2.0 THEN 0 ELSE 1 END ) ELSE 0 END INSERT INTO #data VALUES(@Type) SET @Counter = @Counter + 1 END END -- TEST DATA END ;WITH groupIds_cte AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountType ORDER BY Account) - Account AS GroupId FROM #data ), islandRanges_cte AS ( SELECT MIN(Account) AS StartRow, MAX(Account) AS EndRow, AccountType FROM groupIds_cte GROUP BY GroupId,AccountType ), T3 AS ( SELECT I.*, J.AccountType AS Dim FROM islandRanges_cte I INNER JOIN islandRanges_cte J ON (I.EndRow + 1 = J.StartRow AND I.AccountType = 2) UNION ALL SELECT I.*, J.AccountType AS Dim FROM islandRanges_cte I INNER JOIN islandRanges_cte J ON (I.StartRow - 1 = J.EndRow AND I.AccountType = 5) UNION ALL SELECT *, AccountType AS Dim FROM islandRanges_cte WHERE AccountType = 0 OR AccountType = 1 ), T4 AS ( SELECT Account, Dim FROM ( SELECT FlattenRow AS Account, StartRow, EndRow, Dim FROM T3 I CROSS APPLY (VALUES(StartRow),(EndRow)) newValues (FlattenRow) ) T ) --SELECT * FROM T3 ORDER BY StartRow --SELECT * FROM T4 ORDER BY Account -- Final correct result but very very slow SELECT D.Account, D.AccountType, I.Dim FROM T3 I INNER JOIN #data D ON D.Account BETWEEN I.StartRow AND I.EndRow ORDER BY Account

编辑一些时间测试

SET NOCOUNT ON IF OBJECT_ID('tempdb..#data') IS NULL CREATE TABLE #times ( RecId INTEGER IDENTITY(1,1), Batch INTEGER, Method NVARCHAR(255), MethodDescription NVARCHAR(255), RunTime INTEGER ) IF OBJECT_ID('tempdb..#batch') IS NULL CREATE TABLE #batch ( Batch INTEGER IDENTITY(1,1), Bit BIT ) INSERT INTO #batch VALUES(0) IF OBJECT_ID('tempdb..#data') IS NOT NULL DROP TABLE #data CREATE TABLE #data ( Account INTEGER ) CREATE NONCLUSTERED INDEX data_account_index ON #data (Account) IF OBJECT_ID('tempdb..#islands') IS NOT NULL DROP TABLE #islands CREATE TABLE #islands ( AccountFrom INTEGER , AccountTo INTEGER, Dim INTEGER, ) CREATE NONCLUSTERED INDEX islands_from_index ON #islands (AccountFrom, AccountTo, Dim) BEGIN -- TEST DATA INSERT INTO #data SELECT TOP 100000 ROW_NUMBER() OVER(ORDER BY t1.number) AS N FROM master..spt_values t1 CROSS JOIN master..spt_values t2 INSERT INTO #islands SELECT MIN(Account) AS Start, MAX(Account), Grp FROM (SELECT *, NTILE(10) OVER (ORDER BY Account) AS Grp FROM #data) T GROUP BY Grp ORDER BY Start END -- TEST DATA END --SELECT * FROM #data --SELECT * FROM #islands --PRINT CONVERT(varchar(20),DATEDIFF(MS,@RunDate,GETDATE()))+' ms Sub Query' DECLARE @RunDate datetime SET @RunDate=GETDATE() SELECT Account, (SELECT Dim From #islands WHERE Account BETWEEN AccountFrom AND AccountTo) AS Dim FROM #data INSERT INTO #times VALUES ((SELECT MAX(Batch) FROM #batch) ,'subquery','',DATEDIFF(MS,@RunDate,GETDATE())) SET @RunDate=GETDATE() SELECT D.Account, V.Dim FROM #data D CROSS APPLY ( SELECT Dim From #islands V WHERE D.Account BETWEEN V.AccountFrom AND V.AccountTo ) V INSERT INTO #times VALUES ((SELECT MAX(Batch) FROM #batch) ,'crossapply','',DATEDIFF(MS,@RunDate,GETDATE())) SET @RunDate=GETDATE() SELECT D.Account, I.Dim FROM #data D JOIN #islands I ON D.Account BETWEEN I.AccountFrom AND I.AccountTo INSERT INTO #times VALUES ((SELECT MAX(Batch) FROM #batch) ,'join','',DATEDIFF(MS,@RunDate,GETDATE())) SET @RunDate=GETDATE() ;WITH cte AS ( SELECT Account, AccountFrom, AccountTo, Dim, 1 AS Counting FROM #islands CROSS APPLY (VALUES(AccountFrom),(AccountTo)) V (Account) UNION ALL SELECT Account + 1 ,AccountFrom, AccountTo, Dim, Counting + 1 FROM cte WHERE (Account + 1) > AccountFrom AND (Account + 1) < AccountTo ) SELECT Account, Dim, Counting FROM cte OPTION(MAXRECURSION 32767) INSERT INTO #times VALUES ((SELECT MAX(Batch) FROM #batch) ,'recursivecte','',DATEDIFF(MS,@RunDate,GETDATE()))

您可以从#times表中选择以查看运行时间:)

I have a data table as below

#data --------------- Account AccountType --------------- 1 2 2 0 3 5 4 2 5 1 6 5

AccountType 2 is headers and 5 is totals. Meaning accounts of type 2 have to look after the next 1 or 0 to determin if its Dim value is 1 or 0. Totals of type 5 have to look up at nearest 1 or 0 to determin its Dim value. Accounts of type 1 or 0 have there type as Dim.

Accounts of type 2 appear as islands so its not enough to just check RowNumber + 1 and same goes for accounsts of type 5.

I have arrived at the following table using CTE's. But can't find a quick way to go from here to my final result of Account, AccountType, Dim for all accounts

T3 ------------------- StartRow EndRow AccountType Dim ------------------- 1 1 2 0 2 2 0 0 3 3 5 0 4 4 2 1 5 5 0 1 6 6 5 1

Below code is MS TSQL copy paste it all and see it run. The final join on the CTE select statement is extremly slow for even 500 rows it takes 30 sec. I have 100.000 rows i need to handle. I done a cursor based solution which do it in 10-20 sec thats workable and a fast recursive CTE solution that do it in 5 sec for 100.000 rows, but it dependent on the fragmentation of the #data table. I should add this is simplified the real problem have alot more dimension that need to be taking into account. But it will work the same for this simple problem.

Anyway is there a fast way to do this using joins or another set based solution.

SET NOCOUNT ON IF OBJECT_ID('tempdb..#data') IS NOT NULL DROP TABLE #data CREATE TABLE #data ( Account INTEGER IDENTITY(1,1), AccountType INTEGER, ) BEGIN -- TEST DATA DECLARE @Counter INTEGER = 0 DECLARE @MaxDataRows INTEGER = 50 -- Change here to check performance DECLARE @Type INTEGER WHILE(@Counter < @MaxDataRows) BEGIN SET @Type = CASE WHEN @Counter % 10 < 3 THEN 2 WHEN @Counter % 10 >= 8 THEN 5 WHEN @Counter % 10 >= 3 THEN (CASE WHEN @Counter < @MaxDataRows / 2.0 THEN 0 ELSE 1 END ) ELSE 0 END INSERT INTO #data VALUES(@Type) SET @Counter = @Counter + 1 END END -- TEST DATA END ;WITH groupIds_cte AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountType ORDER BY Account) - Account AS GroupId FROM #data ), islandRanges_cte AS ( SELECT MIN(Account) AS StartRow, MAX(Account) AS EndRow, AccountType FROM groupIds_cte GROUP BY GroupId,AccountType ), T3 AS ( SELECT I.*, J.AccountType AS Dim FROM islandRanges_cte I INNER JOIN islandRanges_cte J ON (I.EndRow + 1 = J.StartRow AND I.AccountType = 2) UNION ALL SELECT I.*, J.AccountType AS Dim FROM islandRanges_cte I INNER JOIN islandRanges_cte J ON (I.StartRow - 1 = J.EndRow AND I.AccountType = 5) UNION ALL SELECT *, AccountType AS Dim FROM islandRanges_cte WHERE AccountType = 0 OR AccountType = 1 ), T4 AS ( SELECT Account, Dim FROM ( SELECT FlattenRow AS Account, StartRow, EndRow, Dim FROM T3 I CROSS APPLY (VALUES(StartRow),(EndRow)) newValues (FlattenRow) ) T ) --SELECT * FROM T3 ORDER BY StartRow --SELECT * FROM T4 ORDER BY Account -- Final correct result but very very slow SELECT D.Account, D.AccountType, I.Dim FROM T3 I INNER JOIN #data D ON D.Account BETWEEN I.StartRow AND I.EndRow ORDER BY Account

EDIT with some time testing

SET NOCOUNT ON IF OBJECT_ID('tempdb..#data') IS NULL CREATE TABLE #times ( RecId INTEGER IDENTITY(1,1), Batch INTEGER, Method NVARCHAR(255), MethodDescription NVARCHAR(255), RunTime INTEGER ) IF OBJECT_ID('tempdb..#batch') IS NULL CREATE TABLE #batch ( Batch INTEGER IDENTITY(1,1), Bit BIT ) INSERT INTO #batch VALUES(0) IF OBJECT_ID('tempdb..#data') IS NOT NULL DROP TABLE #data CREATE TABLE #data ( Account INTEGER ) CREATE NONCLUSTERED INDEX data_account_index ON #data (Account) IF OBJECT_ID('tempdb..#islands') IS NOT NULL DROP TABLE #islands CREATE TABLE #islands ( AccountFrom INTEGER , AccountTo INTEGER, Dim INTEGER, ) CREATE NONCLUSTERED INDEX islands_from_index ON #islands (AccountFrom, AccountTo, Dim) BEGIN -- TEST DATA INSERT INTO #data SELECT TOP 100000 ROW_NUMBER() OVER(ORDER BY t1.number) AS N FROM master..spt_values t1 CROSS JOIN master..spt_values t2 INSERT INTO #islands SELECT MIN(Account) AS Start, MAX(Account), Grp FROM (SELECT *, NTILE(10) OVER (ORDER BY Account) AS Grp FROM #data) T GROUP BY Grp ORDER BY Start END -- TEST DATA END --SELECT * FROM #data --SELECT * FROM #islands --PRINT CONVERT(varchar(20),DATEDIFF(MS,@RunDate,GETDATE()))+' ms Sub Query' DECLARE @RunDate datetime SET @RunDate=GETDATE() SELECT Account, (SELECT Dim From #islands WHERE Account BETWEEN AccountFrom AND AccountTo) AS Dim FROM #data INSERT INTO #times VALUES ((SELECT MAX(Batch) FROM #batch) ,'subquery','',DATEDIFF(MS,@RunDate,GETDATE())) SET @RunDate=GETDATE() SELECT D.Account, V.Dim FROM #data D CROSS APPLY ( SELECT Dim From #islands V WHERE D.Account BETWEEN V.AccountFrom AND V.AccountTo ) V INSERT INTO #times VALUES ((SELECT MAX(Batch) FROM #batch) ,'crossapply','',DATEDIFF(MS,@RunDate,GETDATE())) SET @RunDate=GETDATE() SELECT D.Account, I.Dim FROM #data D JOIN #islands I ON D.Account BETWEEN I.AccountFrom AND I.AccountTo INSERT INTO #times VALUES ((SELECT MAX(Batch) FROM #batch) ,'join','',DATEDIFF(MS,@RunDate,GETDATE())) SET @RunDate=GETDATE() ;WITH cte AS ( SELECT Account, AccountFrom, AccountTo, Dim, 1 AS Counting FROM #islands CROSS APPLY (VALUES(AccountFrom),(AccountTo)) V (Account) UNION ALL SELECT Account + 1 ,AccountFrom, AccountTo, Dim, Counting + 1 FROM cte WHERE (Account + 1) > AccountFrom AND (Account + 1) < AccountTo ) SELECT Account, Dim, Counting FROM cte OPTION(MAXRECURSION 32767) INSERT INTO #times VALUES ((SELECT MAX(Batch) FROM #batch) ,'recursivecte','',DATEDIFF(MS,@RunDate,GETDATE()))

You can select from the #times table to see the run times :)

最满意答案

我想你想要一个联接，但是使用不平等而不是平等：

select tt.id, tt.dim1, it.dim2 from TallyTable tt join IslandsTable it on tt.id between it."from" and it."to"

这适用于您在问题中提供的数据。

这是另一个可能有用的想法。这是查询：

select d.*, (select top 1 AccountType from #data d2 where d2.Account > d.Account and d2.AccountType not in (2, 5) ) nextAccountType from #data d order by d.account;

我只是在50,000行上运行它，这个版本在我的系统上花了17秒。将表格更改为：

CREATE TABLE #data ( Account INTEGER IDENTITY(1,1) primary key, AccountType INTEGER, );

实际上已经将它减慢到大约1:33 - 令我惊讶的是。也许其中一个会帮助你。

I think you want a join, but using an inequality rather than an equality:

select tt.id, tt.dim1, it.dim2 from TallyTable tt join IslandsTable it on tt.id between it."from" and it."to"

This works for the data that you provide in the question.

Here is another idea that might work. Here is the query:

select d.*, (select top 1 AccountType from #data d2 where d2.Account > d.Account and d2.AccountType not in (2, 5) ) nextAccountType from #data d order by d.account;

I just ran this on 50,000 rows and this version took 17 seconds on my system. Changing the table to:

CREATE TABLE #data ( Account INTEGER IDENTITY(1,1) primary key, AccountType INTEGER, );

Has actually slowed it down to about 1:33 -- quite to my surprise. Perhaps one of these will help you.

更多推荐