tech spark presentation

41
Big Data Stephen Borg Tech Spark – Microsoft Innovation Center

Upload: stephen-borg

Post on 07-Jan-2017

72 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tech Spark Presentation

Big Data Stephen Borg

Tech Spark – Microsoft Innovation Center

Page 2: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

What we are covering… today

• Problems that Big Data can help with• Infrastructure setup through Azure• Architecture for our use case• Spark application, NoSQL Database• Presentation of data through Zeppelin

Page 3: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Problems….• Time… ETL process taking most of the

night…• Process crashes during peak times due to

application processing• Outgrown what you can do with RDBMS

such as MySQL, SQL Server, Oracle• To scale out your environment is a big issue

Page 4: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Problems….• Non-structured data doesn’t fit my RDBMS• Hitting the limit with Optimisation…. • Optimisation on backend, or SQL only extends

the time• Not being able to be proactive due to

analytics taking too long to process• Analytics not being on time!

Page 5: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Moving on to HadoopTypical components of big data systems• Distributed databases, Hbase, Hive• Distributed processing systems, Map Reduce &

YARN• Distributed file systems, HDFS

Page 6: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Industry wide problem : Payment Fraud

• Merchants are loosing $250B globally• Cost of fraud is around 1% of Revenue for

retailers (2014)• Fraud increases due to newer channels on the

market

• Reference : http://www.lexisnexis.com/risk/downloads/assets/true-cost-fraud-2014.pdf

Page 7: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

The requirement…Fraud toolkit

Knowledge from past payment fraud

Batch processing

RT Analytics

Interactive Analytics

Page 8: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Fraud : Anomaly• Generic rules• Base rules on scores• Use of models to detect fraud

CATCH THEM IN THE ACT!

Page 9: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Fraudsters : flag examples• Stolen cards• Buy expensive items• In larger than usual quantities• Very quickly• At odd hours based on country• Risky country?• Blacklisted IP

Page 10: Tech Spark Presentation

What are the right tools?Answer : There isn’t just one! However…. Pick

wisely….

Tech Spark – Microsoft Innovation Center

Page 11: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Our selection of tools and languages…

Page 12: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Microsoft Azure for a virtual env.

Page 13: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Our node configuration…

Resource ManagerActive NN

Metrics & Ambari

Service Clients

Kafka Cluster

Hbase MasterSecondary NN

Spark History ServerZookeeper Server

services01 services02 services03

gateway

ambari

Worker01 Worker02 Worker03

Total of 8 cores and 42GB , 84GB SSD

Kafka Cluster

Kafka Cluster

2 core 14GB RAM 2 core 14GB RAM2 core 14GB RAM

1 core 3.5GB

2 core 7GB RAM

Page 14: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Architecture working together

Page 15: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Setting up the infrastructure• Static IPs & password less authentication

from Ambari node to all cluster nodes

Page 16: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Admin your cluster through Ambari• Alerts• Change configurations easily• Handle rolling restarts• Add more nodes to your cluster• Manage upgrades of Hadoop

Page 17: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Brief overview of Ambari

Page 18: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

First step : Mock Data!• Using fluttercode random test data that mock

payments • Typical transaction:

{"cardNumber":"3584237251420382","longtitude":"46.88295","latitude":"36.54631","itemPrice":6.2724032E7,"quantity":1010366557,"currencyCode":"MXN","email":"[email protected]","ip":"184.149.76.20","username":"[email protected]","customerId":"1af87f6a-d6a8-4c7e-b5f5-d11410878eb8","countryCode":"KH","paymentMethodName":"Mastercard","productName":"Nike T-Shirt M","productCategory":"Clothes","premium":false,"timestamp":1479941448382}

Page 19: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Sending data to Kafka• Our java application will be our data mocker,

and sends data to Kafka as a producer• Kafka uses Zookeeper for configurations

Page 20: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Messages received can be seen by Kafka console consumers

Page 21: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Preparing to store data - NoSQL• Use of Hbase• Create several tables to store lookup data• These tables can be populated by an overnight

process• Compare TX coming in with this data

Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al.

Page 22: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Create Hbase tables & Insert Data• Open hbase shell, and execute “list” will list

you available tables

Page 23: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Querying Hbase has a bit of a learning curve…

• Query with a filterscan 'PotentialFraudStream', { COLUMNS => ['info:Day'], FILTER => "RowFilter(=, 'substring:2016-11-24')" }• Query limiting resultsscan "PotentialFraudStream",{LIMIT=>5}

Page 24: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Apache Phoenix puts SQL back in NoSQL…• Open up a session in Phoenix…

• Create a view that maps to our Hbase table CREATE VIEW "PotentialFraudStream" ( "CustomerId" VARCHAR PRIMARY KEY, "info"."Score" VARCHAR, "info"."Lon" VARCHAR, "info"."Lat" VARCHAR, "info"."Message" VARCHAR , "info"."LastTransactionTime" VARCHAR, "info"."LastTransactionJSON" VARCHAR, "info"."Day" VARCHAR, "info"."Country" VARCHAR);

Page 25: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

And there you have it… • We can use SQL again…

Page 26: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Benchmarking Phoenix• Phoenix vs Hive (running over HDFS & Hbase)

https://phoenix.apache.org/performance.html

Page 27: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Recap…• So now we have our tables in Hbase to be

accessed real-time• We have a table that can be queried by users in

real-time• So let’s insert the lookup informationhttps://stephenborg.atlassian.net/wiki/display/BD/TechSpark+Demo

Page 28: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Apache Spark• Large scale in memory processing engine• Processes data in micro batches• Supports Java, Scala, Python & R• Configure batches of X seconds, and allocate

resources to a context, and processes batches

Page 29: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Submitting spark application to YARNspark-submit --conf "spark.ui.port=4099" --master yarn --class com.techsparkdemo.StreamBooter /home/stephen.borg/spark-streamer-1.0-SNAPSHOT-jar-with-dependencies.jar "/tmp/SparkCheckpoints/StreamingCheckpoint" "Tech Spark Demo" "yarn-client" 15 "104.40.216.218:2181,52.174.110.65:2181,13.95.23.152:2181" "hdfs://13.95.23.152:8020" "purchasesStreamingDemo" "Purchases" "/hbase-unsecure" 10

Additional can be --num-executors 3 (workers) --executor-memory 10G (memory allocated per worker) --driver-memory 2G (submitted via yarn-client, the worker executing the application will have 2GB)

Page 30: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Spark History Server• Optimize your spark application• Know where resources are allocated, and

debug failures

Page 31: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Recap again…• So we now have our data mocker sending us

data• We have data being processed by the spark

application, and potential fraud customers are funneled into a stream, and written into a specific table in Hbase

Page 32: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Let’s now visualise that data…• Web based notebook that allows developers to

create interactive analysis with different languages such as Scala, java, SQL

• You tell the Zeppelin context what language you are supplying the notebook by the special reserved keyword %[interpreter] in the beginning of each notebook

Page 33: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

What interpreters will we use?• Up to the developers choice but we will use • %dep – to load a dependent JAR• %jdbc (phoenix) – to communite with Phoenix• %angular – to plot an OpenStreetMap using

angular

Page 34: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

A mini ETL to show potential• Step 1 : Load dependencies • Step 2 : Collect and prepare data in an array

using Scala• Step 3 : Plot chart via angular script• Step 4 : Schedule via embedded cron scheduler

• Next : Results can be used to action

Page 35: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

An OpenStreetMap for fraud

Page 36: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Most important point• Phoenix uses JDBC and many tools can extract

data either via UI or code

Page 37: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Points that can backfire • So what if people buy expensive items? Rich

person• Someone can also buy a lot… Buying gifts• Very quickly… Busy person• At odd hours… Who doesn’t have a long night

every now and then?

Page 38: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Room for improvement • We can apply data science models to the spark

application and not just rules

Page 39: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Worth mentioning• ZeppelinHub for keeping track of your

notebooks• All that we mentioned today is documented

https://stephenborg.atlassian.net/wiki/display/BD/TechSpark+Demo

• All source code for this demo https://bitbucket.org/stephenborg1987/techspark

Page 41: Tech Spark Presentation

Tech Spark – Microsoft Innovation Center

Any help, questions• Feel free to drop a line

[email protected]