方圆并济:基于 Spark on Angel 的高性能分布式机器学习
•
•
•
•
•
•
源起
腾讯的产品需求
SmallModel
d
Big Datan
d
d<<n
SparseBig Data
d
Big Model
d
d ≈ n
寻找满足十亿级维度的工业级的分布式机器学习平台
Executor
Driver
ModelExecutor
Executor
Executor
Executor
Executor
Driver
Model
Executor
Executor
Executor
Spark机器学习的瓶颈
●
●
●
One Issue
https://issues.apache.org/jira/browse/SPARK-6932
A Prototype of Parameter Server
2015
Glint & Yahoo
2016
理念
Worker
PS PS PS
Spark Worker Worker Worker Worker
Angel mutable
immutable
—— 方圆并济
Spark on Angel
核心抽象
MapperReducer RDD PSModel
RDD vs PSModel
RDD-1 RDD-2 RDD-3 RDD-4 RDD-5
PSModel
epoch-1 epoch-2 epoch-3 epoch-4 epoch-5
epoch……………………
RDD的核心抽象RDD
Partition-1
Partition-2
Partition-3
Partition-4
Partition-n
Compute Func
…………………
Dependencies
NodeMemory Node Disk
MemoryBlock -n
DiskBlock -n
Preferred locationsPartitioners
RDD
RDD
…………………
(Transformation or Action)
PSModel的核心抽象
PSModelM
pull
ΔM
push
Shard
PSServer
MatrixContext
Sync
PSPartitioner
Partition1
Partition2
Partition-……
Partition3
PSClient
Clock
Spark on Angel的架构
PSAgent PSAgent
SPARKRDD ……………………
Parameter Server Shard
PSServer
Shard
PSServer
PSAgent
Shard
PSServer
PSModel
Executor
TASK
TASK
TASK
PSModel
Executor
TASK
TASK
TASK
AngelContext
SparkDriver
……………………
PSAgentPSAgentPSAgent
Parameter Server
Model M pull ΔMpush
Shard
PSServer
Shard
PSServer
Shard
PSServer
Worker
psFuncModel PartitionersyncProtocol
PsClient
DataBlock
Task
PsClient
DataBlock
Task
•••
丰富的机器学习及数学计算库
•••
友好的用户编程接口
•••
工业级别可用的参数服务器
Angel和Glint的比较
PSPartitioner
Partition1
Partition2
Partition-……
Partition3
更丰富的模型切分 更灵活的异步模式 更强大的psFunc
Angel的定位
https://github.com/tencent/angel
Spark on Angel的开发
Angel的API设计
TrainTask
1. Start PS
2. Load Model
3.runTask
4.parse & preProcess
5.train
6.learn
HDFS
8.Save ModelHDFS
AngelClient
MLLearner
DataBlockLabledData
LabledData
LabledData
MLModel
7.push & pullPSModel
PSModel
PSModel
Model
PSServer
MLRunner
MLModelRDD
Spark on Angel的API设计
RDD2
RDD3
……
RDD1
Shard
PSServer
AngelClient
PSClient
AngelSpark on AngelSpark
SparkPSContext
PSModel
{ RDD_PS_Functions }
PSVector PSMartrix
BreezePSVector CachedPSVector
Spark on Angel的基础写法
•
•
••••
<<class>>BreezeVector
def round(t: T):Tdef dot(t: T):Tdef max(t: T):T
…
<<trait>>NumericOps[T]
def round(t: T):Tdef dot(t: T):Tdef max(t: T):T
…
<<class>>BreezePSVector
def round(t: T):Tdef dot(t: T):Tdef max(t: T):T
…
混入相同特征
PSAgent
进行透明替换
Angle PS
•••
Vector的透明替换
Executor
Task
BreezePSVector
BreezePSVector
BreezePSVector
PSClient
Angel的算法
Spark on AngelAvailable
LR on Angel
Pull parameters from PS
Push update value to PS
2.
PS PS PS PS
Worker Worker Worker
HDFS HDFS HDFS
0.
1.
[Spark on Angel] LR
[spark_on_angel_quick_start.md]
{BreezeOps}
wPS gradientPS
Angel
Spark sampleRDDmapPartitions
DenseVectorArray
优化方法
[Spark on Angel] LR with Optimizer
wPS statePS Angel
DenseVector
SparksampleRDD
mapPartitions
SGD OWLQN LBFGS
Breeze.optimizer
DiffFunction(BreezePSVector) : (Double, BreezePSVector)
[spark_on_angel_optimizer.md]
GBDT:树模型+Boosting
Age<30
Wage<10K
IsMale?Y
Y
YN
N
N
tree 1 tree 2
predict( ) 5+0.5=5.5
predict( ) 10+1.5=11.5
predict( ) 1+1.5=2.5
predict( ) 1+0.5=1.5
predict( ) 1+1.5=2.5
A
B
C
D
E
GBDT on Angel: 模型存储
feature value
feature ID
leaf prediction
PS1
feature value
feature ID
leaf prediction
PS2
feature value
feature ID
leaf prediction
PS3
grad histogram
hess histogram
GBDT on Angel(1):构建森林
PS1 PS2 PS3
Worker1 Worker2 Worker3
GBDT on Angel(2): 分裂树节点
find split feature & value
[gbdt_on_angel.md]
Angel
Spark
[Spark on Angel] GBDT
Instance RDD Gradient RDD Prediction RDDzip zip
InstanceLayout
PS
map
Grad Histogram
PS
SplitFeature
PS
SplitValue
PS
LeafWeight
PS
[spark_on_angel_gbdt.md]
(Spark on Angel)vs Spark —— LR
Angel vs XGBoost —— GBDT
Angel vs Spark —— LDA
Angel vs Spark —— GD-LR
Angel vs Spark —— ADMM-LR
Spark on Angel的特点
OpenSource & Perspective
Angel开源
• [GBDT] The purposes of using parameter server in GBDT #7
•
•
•
•
(PR 60)
学术创新
•
•
•
•
•
• 国际顶级会议Paper(CCF A类)
版本展望(What is Next)
V1.3 V1.5 V2.0