Druid at Pulsar"/>
Druid at Pulsar
作者:Xiaoming Zhang
A glance of Pulsar and druid
Pulsar is anopen source project of eBay and it includes two parts, pulsar pipeline andpulsar reporting. Pulsar pipeline is a streaming framework which willdistribute more than 8 billion events every day and pulsar reporting is in responseof storing, querying and visualizing these data. Druid is part of pulsarreporting.
This paper willhave an introduction and a little deep dive of druid and show you the role itis playing at pulsar reporting.
Druid components introduction
Druid is an open source project which is ananalytics data store designed for business intelligence (Online analyticalprocessing) queries on event data.
Druid Skills (From official website):
1. Sub-Second Queries.
Support multidimensional filtering, aggression and is ableto target the very data to do query.
2. Real time Ingestion
Support streaming data ingestion and offers insightson events immediately after they occur
3. Scalable
Able to deal with trillions of events for total,millions events for each second
4. Highly Available
SaaS (Software as a service), need to be up all the timeand Scale up and down will not lose data
5. Designed for Analytics
Supports a lot of filters, aggregators and query types, is ableto plugging in new functionality.
Supports approximate algorithms for cardinality estimation,and histogram and quantile calculations.
Glance at Druid Structure of Pulsarreporting:
Receiveabout 10 Billion events per day and the peak traffic is about 200k/s.
Eachmachine at our cluster is with 128GB memory and for each historical nodes, diskis more than 6 TB.
Druid ata glance:
Briefintroduction to all nodes:
Real-time
Real-timenode index the coming data and these indexed data are able to queryimmediately. Real-time nodes will build up data to segments and after a periodof time the segment will handover to historical node.
Anexample of real-time segment: 2015-11-18T06:00:00.000Z_2015-11-18T07:00:00.000Z,which will be stored at the folder of the scheme you defined. All segments arestored like the above format.
Here isthe segment information at My SQL:
Id |dataSource | created_date | start | end | partitioned | version | used |payload pulsar_event_2014-09-15T05:00:00.000-07:00_2014-09-15T06:00:00.000-07:00_2014-09-15T05:00:00.000-07:00_1| pulsar_event | 2014-09-15T09:37:30.231-07:00 | 2014-09-15T05:00:00.000-07:00| 2014-09-15T06:00:00.000-07:00 | 1 | 2014-09-15T05:00:00.000-07:00 | 0 | {"dataSource":"pulsar_event","interval":"2014-09-15T05:00:00.000-07:00/2014-09-15T06:00:00.000-07:00","version":"2014-09-15T05:00:00.000-07:00","loadSpec":{"type":"hdfs","path":"hdfs://xxxx/20140915T050000.000-0700_20140915T060000.000-0700/2014-09-15T05_00_00.000-07_00/1/index.zip"},"dimensions":"browserfamily,browserversion,city,continent,country,deviceclass,devicefamily,eventtype,guid,js_ev_type,linespeed,osfamily,osversion,page,region,sessionid,site,tenant,timestamp,uid","metrics":"count","shardSpec":{"type
更多推荐
Druid at Pulsar
发布评论