如何提高BigQuery中GeoIP查询的性能?

编程入门 行业动态 更新时间:2024-10-08 13:36:02
本文介绍了如何提高BigQuery中GeoIP查询的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我已经在BigQuery中加载了我的应用程序日志,并且需要根据这些日志中的IP地址计算国家。

我已经在我的表和我从 MaxMind 下载的GeoIP映射表。

一个理想的查询将是带有范围过滤器的 OUTER JOIN ,但是 BQ 仅支持 = 在连接条件中。 因此,查询执行 INNER JOIN 并处理 JOIN 中每一侧的缺失值。 p>

我修改了我的原始查询,以便它可以运行在维基百科公共数据集上。

有人可以帮我吗使这个运行更快?

SELECT id,client_ip,client_ip_code,B.Country_Name as Country_Name FROM (SELECT id,contributor_ip as client_ip,INTEGER(PARSE_IP(contributor_ip))AS client_ip_code,1 AS One FROM [publicdata:samples.wikipedia]限制1000)AS A1 JOIN (选择From_IP_Code,To_IP_Code,Country_Name,1 AS一个 FROM - 3个IP集合:1.有效范围,2.差距,3.最后差距的集合 - 所有有效IP的范围:(SELECT From_IP_Code,To_IP_Code,Country_Name FROM [QA_DATASET.GeoIP]) - 缺少From_IP $ b的怒气$ b,(SELECT PriorRangeEndIP + 1 From_ IP_Code, From_IP_Code - 1 AS To_IP_Code,'NA'AS Country_Name FROM - 使用LAG函数查找先前的有效范围( SELECT From_IP_Code, To_IP_Code,Country_Name, LAG(To_IP_Code,1,INTEGER(0)) OVER(ORDER BY From_IP_Code asc)PriorRangeEndIP FROM [QA_DATASET。 GeoIP])A - 如果与先前有效范围的差距>> 1比填补的差距WHERE From_IP_Code> PriorRangeEndIP + 1) - 丢失的怒气更高tan最大To_IP ,(SELECT MAX(To_IP_Code)+1作为From_IP_Code,INTEGER(4311810304)作为To_IP_Code,'NA'AS Country_Name FROM [QA_DATASET.GeoIP]))AS B ON A1.ONE = B.ONE - 假连接条件克服在连接中只允许使用= b $ b - 加入左边存在有效IP的条件 WHERE A1.client_ip_code> = B.From_IP_Code AND A1.client_ip_code <= B.To_IP_Code OR(A1。 client_ip_code IS NULL AND B.From_IP_Code = 1) - 左边没有有效IP contributor_ip

googlecloudplatform.blogspot/2014/03/geoip-geolocation-with-google-bigquery.html

<让我整理原始查询:

SELECT id, client_ip, client_ip_code, B.Country_Name AS Country_Name FROM( SELECT id, contributor_ip AS client_ip, INTEGER(PARSE_IP(contributor_ip))AS client_ip_code, 1 AS FROM [publicdata:samples.wikipedia] WHERE contributor_ip不是NULL LIMIT 1000 )AS A1 LEFT JOIN SELECT From_IP_Code, To_IP_Code, Country_Name, 1 AS FROM --3 IP集:1.有效范围, ( SELECT From_IP_Code, To_IP_Code, Country_Name FROM [ ) - 所有范围ov有效IP ,( SELECT PriorRangeEndIP + 1 From_IP_Code, From_IP_Code-1 AS To_IP_Code, 'NA'AS Country_Name - 缺少的怒气低于FROM From_IP from( SELECT From_IP_Code, To_IP_Code, Country_Name , LAG(To_IP_Code, 1, INTEGER(0))OVER( ORDER BY From_IP_Code ASC)PriorRangeEndIP - 使用LAG函数查找先前的有效范围 FROM [playscape-proj:GeoIP。 GeoIP])A WHERE From_IP_Code> PriorRangeEndIP + 1) - 如果与先前有效范围IS的差距大于1,那么它与填充的差距,( SELECT MAX(To_IP_Code)+1 AS From_IP_Code, INTEGER(4311810304)AS To_IP_Code,'NA'AS Country_Name - 丢失的怒气更高tan最大值To_IP FROM [playscape-proj:GeoIP.GeoIP]))AS B ON A1.ONE = B.ONE - 使JOIN条件克服允许使用= only IN连接 WHERE A1.client_ip_code> = B.From_IP_Code AND A1.client_ip_code <= B.To_IP_Code - JOIN条件WHERE有效的IP存在ON左 OR(A1.client_ip_code IS NULL AND B.From_IP_Code = 1) - WHERE不存在有效IP ON left contributor_ip;

这是一个长查询! (和一个非常有趣的)。它在14秒内运行。

$ b 跳过空白。如果日志中没有ip地址,请不要尝试匹配它。
  • 减少组合。而不是使用每条右侧记录加入每条左侧记录,而只需将左侧的39.x.x.x记录与右侧的39.x.x.x记录相连接。只有少数(3或4)规则涵盖多个范围。在geolite表中添加一些规则以添加规则来弥补这些差距是很容易的。
  • 所以我正在改变:

    • 1 AS One to INTEGER(PARSE_IP
    • 添加一个'WHERE contributor_ip不为空'。
    • $($ contrib_ip)/(256 * 256 * 256))AS One b $ b

    现在它在3秒内运行! 5%的ips不能被定位,可能是由于所描述的差距(简单修复)。

    现在,从LIMIT 1000到LIMIT 300000的过程如何?会花费吗? !比描述的25分钟好得多。如果你想走得更高,我会建议把右边的桌子变成一个静态桌子 - 就像曾经计算过的那样,它根本不会改变,这只是基本规则的扩展。然后你可以使用JOIN EACH。 pre $ SELECT id client_ip client_ip_code, B.Country_Name AS Country_Name FROM( SELECT id, contributor_ip AS client_ip, INTEGER(PARSE_IP(contributor_ip))AS client_ip_code, INTEGER(PARSE_IP(contributor_ip)/(256 * 256 * 256))AS FROM [publicdata:samples.wikipedia] WHERE contributor_ip不是NULL LIMIT 300000 )AS A1 JOIN ( SELECT From_IP_Code, To_IP_Code, Country_Name, INTEGER( From_IP_Code /(256 * 256 * 256))AS FROM --3 IP集合:1.有效范围,2.Gaps,3.集合结束处的空位 SELECT From_IP_Code, To_IP_Code, Country_Name FROM [playscape-proj:GeoIP.GeoIP]) - 所有范围ov有效IP ,( SELECT PriorRangeEndIP + 1 From_IP_Code, From_IP_Code-1 AS To_IP_Code,'NA'AS Country_Name - 缺少愤怒低于FROM_IP from( SELECT From_IP_Code, To_IP_Code, Country_Name , LAG(To_IP_Code, 1, INTEGER(0))OVER( ORDER BY From_IP_Code ASC)PriorRangeEndIP - 使用LAG函数查找先前的有效范围 FROM [playscape-proj:GeoIP.GeoIP])A WHERE From_IP_Code> PriorRangeEndIP + 1) - 如果与先前有效范围的差距IS> 1比填补,( SELECT MAX(To_IP_Code)+1 AS From_IP_Code, INTEGER(4311810304)AS To_IP_Code,'NA'AS Country_Name - 丢失的怒气更高tan最大值To_IP 来自 [playscape-proj:GeoIP.GeoIP]))作为B 对于A1.ONE = B.ONE - 使JOIN条件克服允许的使用=只有IN连接 WHERE A1.client_ip_code> = B.From_IP_Code AND A1.client_ip_code <= B.To_IP_Code - 连接条件WHERE有效IP存在ON左 OR(A1.client_ip_code IS NULL 和B.From_IP_Code = 1) - 哪里没有有效的IP ON left contributor_ip;

    I have loaded my application logs in BigQuery and I need to calculate country based on IP address from those logs.

    I have written a join query between my table and a GeoIP mapping table that I downloaded from MaxMind.

    An ideal query would be OUTER JOIN with range filter, however BQ supports only = in join conditions. So the query does an INNER JOIN and handles missing values in each side of the JOIN.

    I have amended my original query so it could run on the Wikipedia public data set.

    Can someone please help me make this run faster?

    SELECT id, client_ip, client_ip_code, B.Country_Name as Country_Name FROM (SELECT id, contributor_ip as client_ip, INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code, 1 AS One FROM [publicdata:samples.wikipedia] Limit 1000) AS A1 JOIN (SELECT From_IP_Code, To_IP_Code, Country_Name, 1 AS One FROM -- 3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the end of the set -- all Ranges of valid IPs: (SELECT From_IP_Code, To_IP_Code, Country_Name FROM [QA_DATASET.GeoIP]) -- Missing rages lower from From_IP ,(SELECT PriorRangeEndIP + 1 From_IP_Code, From_IP_Code - 1 AS To_IP_Code, 'NA' AS Country_Name FROM -- use of LAG function to find prior valid range (SELECT From_IP_Code, To_IP_Code, Country_Name, LAG(To_IP_Code, 1, INTEGER(0)) OVER(ORDER BY From_IP_Code asc) PriorRangeEndIP FROM [QA_DATASET.GeoIP]) A -- If gap from prior valid range is > 1 than its a gap to fill WHERE From_IP_Code > PriorRangeEndIP + 1) -- Missing rages higher tan Max To_IP ,(SELECT MAX(To_IP_Code) + 1 as From_IP_Code, INTEGER(4311810304) as To_IP_Code, 'NA' AS Country_Name FROM [QA_DATASET.GeoIP]) ) AS B ON A1.ONE = B.ONE -- fake join condition to overcome allowed use of only = in joins -- Join condition where valid IP exists on left WHERE A1.client_ip_code >= B.From_IP_Code AND A1.client_ip_code <= B.To_IP_Code OR (A1.client_ip_code IS NULL AND B.From_IP_Code = 1) -- where there is no valid IP on left contributor_ip

    解决方案

    Cleaned up version of this answer at: googlecloudplatform.blogspot/2014/03/geoip-geolocation-with-google-bigquery.html

    Let me tidy the original query:

    SELECT id, client_ip, client_ip_code, B.Country_Name AS Country_Name FROM ( SELECT id, contributor_ip AS client_ip, INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code, 1 AS One FROM [publicdata:samples.wikipedia] WHERE contributor_ip IS NOT NULL LIMIT 1000 ) AS A1 LEFT JOIN ( SELECT From_IP_Code, To_IP_Code, Country_Name, 1 AS One FROM --3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the END of the set ( SELECT From_IP_Code, To_IP_Code, Country_Name FROM [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs , ( SELECT PriorRangeEndIP+1 From_IP_Code, From_IP_Code-1 AS To_IP_Code, 'NA' AS Country_Name -- Missing rages lower FROM From_IP from( SELECT From_IP_Code, To_IP_Code, Country_Name , LAG(To_IP_Code, 1, INTEGER(0)) OVER( ORDER BY From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range FROM [playscape-proj:GeoIP.GeoIP])A WHERE From_IP_Code>PriorRangeEndIP+1) -- If gap FROM prior valid range IS >1 than its a gap to fill , ( SELECT MAX(To_IP_Code)+1 AS From_IP_Code, INTEGER (4311810304) AS To_IP_Code, 'NA' AS Country_Name -- Missing rages higher tan Max To_IP FROM [playscape-proj:GeoIP.GeoIP]) ) AS B ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins WHERE A1.client_ip_code>=B.From_IP_Code AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left OR (A1.client_ip_code IS NULL AND B.From_IP_Code=1 ) -- WHERE there IS no valid IP ON left contributor_ip;

    That's a long query! (and a very interesting one). It runs in 14 seconds. How can we optimize it?

    Some tricks I found:

    • Skip NULLs. If there is no ip address in a log, don't try to match it.
    • Reduce the combinations. Instead of JOINing every left side record with every right side record, how about joining only the 39.x.x.x records on the left side with the 39.x.x.x records on the right side. There are only a few (3 or 4) rules that cover multiple ranges. It would be easy to add a couple of rules on the geolite table to add rules to cover these gaps.

    So I'm changing:

    • 1 AS One to INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One (twice).
    • Adding a 'WHERE contributor_ip IS NOT NULL`.

    And now it runs in 3 seconds! 5% of the ips could not be geolocated, probably by the described gaps (easy fix).

    Now, how about going from the LIMIT 1000 to LIMIT 300000. How long will it take?

    37 seconds! Much better than the described 25 minutes. If you want to go even higher, I would suggest turning the right side table into a static one - as once computed it doesn't change at all, it's just an expansion of the basic rules. Then you can use JOIN EACH.

    SELECT id, client_ip, client_ip_code, B.Country_Name AS Country_Name FROM ( SELECT id, contributor_ip AS client_ip, INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code, INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One FROM [publicdata:samples.wikipedia] WHERE contributor_ip IS NOT NULL LIMIT 300000 ) AS A1 JOIN ( SELECT From_IP_Code, To_IP_Code, Country_Name, INTEGER(From_IP_Code/(256*256*256)) AS One FROM --3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the END of the set ( SELECT From_IP_Code, To_IP_Code, Country_Name FROM [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs , ( SELECT PriorRangeEndIP+1 From_IP_Code, From_IP_Code-1 AS To_IP_Code, 'NA' AS Country_Name -- Missing rages lower FROM From_IP from( SELECT From_IP_Code, To_IP_Code, Country_Name , LAG(To_IP_Code, 1, INTEGER(0)) OVER( ORDER BY From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range FROM [playscape-proj:GeoIP.GeoIP])A WHERE From_IP_Code>PriorRangeEndIP+1) -- If gap FROM prior valid range IS >1 than its a gap to fill , ( SELECT MAX(To_IP_Code)+1 AS From_IP_Code, INTEGER (4311810304) AS To_IP_Code, 'NA' AS Country_Name -- Missing rages higher tan Max To_IP FROM [playscape-proj:GeoIP.GeoIP]) ) AS B ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins WHERE A1.client_ip_code>=B.From_IP_Code AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left OR (A1.client_ip_code IS NULL AND B.From_IP_Code=1 ) -- WHERE there IS no valid IP ON left contributor_ip;

    更多推荐

    如何提高BigQuery中GeoIP查询的性能?

    本文发布于:2023-10-19 17:34:27,感谢您对本站的认可!
    本文链接:https://www.elefans.com/category/jswz/34/1508267.html
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:性能   BigQuery   GeoIP

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!