找回密码
 会员注册
查看: 28|回复: 0

Python手册(MachineLearning)--LightGBM

[复制链接]

3

主题

0

回帖

10

积分

新手上路

积分
10
发表于 2024-9-8 00:10:32 | 显示全部楼层 |阅读模式
QuickStartLightGBM(LightGradientBoostingMachine)是一种高效的GradientBoosting算法,主要用于解决GBDT在海量数据中遇到的问题,以便更好更快的用于工业实践中。在实际建模环节,LGBM支持Python、Java、C++等多种编程语言进行调用,并同时提供了SklearnAPI和原生API两套调用方法。使用原生LGBMAPI时需要先将数据集转化成一种LGBM库定义的一种特殊的数据格式Dataset,然后以字典形式设置参数,最终使用LGBM中自带的方法lgb.train或lgb.cv进行训练。数据结构说明lightgbm.DatasetLightGBM数据集lightgbm.BoosterLightGBM中的返回的模型lightgbm.CVBoosterCVBoosterinLightGBMlightgbm.Dataset(data,label=None,reference=None,weight=None,group=None,init_score=None,feature_name='auto',categorical_feature='auto',params=None,free_raw_data=True)12345678910常用参数:data内部数据集的数据源label数据标签reference在lightgbm中验证数据集应使用训练数据集作为参考。weight每个样本的权重feature_name(listofstr,or‘auto’)特征名称,默认auto,如果数据是pandas.DataFrame,则使用数据列名称。categorical_feature(listofstr,or‘auto’)分类特征名称。free_raw_data如果为True,则在构建内部数据集后释放原始数据。如果想重复使用Dataset,则设为Falseimportlightgbmaslgb1现在,我们来简单看看原生代码是如何实现的。Step1oadthedataset#loadorcreateyourdatasetfromsklearn.datasetsimportload_bostonfromsklearn.model_selectionimporttrain_test_splitX,y=load_boston(as_frame=True,return_X_y=True)X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3)#createtraindatasetforlightgbmdtrain=lgb.Dataset(X_train,y_train)#InLightGBM,thevalidationdatashouldbealignedwithtrainingdata.deval=lgb.Dataset(X_test,y_test,reference=dtrain)#ifyouwanttore-usedata,remembertosetfree_raw_data=Falsedtrain=lgb.Dataset(X_train,y_train,free_raw_data=False)1234567891011121314LightGBM可以直接使用分类特征,而不需要one-hot编码,且比编码后的更快(about8xspeed-up)#Specificfeaturenamesandcategoricalfeaturesdtrain=lgb.Dataset(X_train,y_train,feature_name=['c1','c2','c3'],categorical_feature=['c3'])12Step2:SettingParameters#LightGBMcanuseadictionarytosetParameters.#Boosterparameters:params={'boosting_type':'gbdt','objective':'regression','num_leaves':31,'learning_rate':0.05,'feature_fraction':0.9,'bagging_fraction':0.8,'bagging_freq':5,'verbose':0}params['metric']='l2'#Youcanalsospecifymultipleevalmetrics:params['metric']=['l2','l1']1234567891011121314151617'运行运行Step3:Training#Trainingamodelrequiresaparameterlistanddataset:bst=lgb.train(params,dtrain,num_boost_round=20,valid_sets=deval,callbacks=[lgb.early_stopping(stopping_rounds=5)])#Trainingwith5-foldCV:lgb.cv(params,dtrain,num_boost_round=20,nfold=5)123456789Step4:Saveandloadmodel#Savemodeltofile:bst.save_model('model.txt')bst=lgb.Booster(model_file='model.txt')#canonlypredictwiththebestiteration(orthesavingiteration)#DumpmodeltoJSON:importjsonmodel_json=bst.dump_model()withopen('model.json','w+')asf:json.dump(model_json,f,indent=4)#Dumpmodelwithpickleimportpicklewithopen('model.pkl','wb')asfout:pickle.dump(gbm,fout)withopen('model.pkl','rb')asfin:pkl_bst=pickle.load(fin)#canpredictwithanyiterationwhenloadedinpickleway123456789101112131415161718Step5redict#Amodelthathasbeentrainedorloadedcanperformpredictionsondatasets:y_pred=bst.predict(X_test)#Ifearlystoppingisenabledduringtraining,youcangetpredictionsfromthebestiterationwithbst.best_iteration:y_pred=bst.predict(X_test,num_iteration=bst.best_iteration)12345Step6:Evaluatingfromsklearn.metricimportmean_squared_errorrmse_test=mean_squared_error(y_test,y_pred)**0.5print(f'TheRMSEofpredictionis:{rmse_test}')123参数lightgbm的参数以dict的格式配置,然后训练的时候传递给lightgbm.train的params参数。接下来我们就逐个解释这些参数,并对其使用方法进行说明。基本参数task:指定任务类型。default=train,aliases:task_typetrain用于训练,alias:trainingpredict用于预测,alias:prediction,testconvert_model将模型文件转换为if-else格式refit用新数据刷新现有模型,alias:refit_treesave_binary将数据集保存到二进制文件中boosting:指定算法类型。default=gbdt,aliases:boosting_type,boostgbdt:传统的梯度提升算法,是最常用、且性能最稳定的boosting类型。alias:gbrt。rf:传统的梯度促进决策树,alias:random_forestdart:(DropoutsmeetMultipleAdditiveRegressionTrees)是一种结合了Dropout和多重加性回归树的方法。它在每次迭代过程中随机选择一部分树进行更新,会较大程度增加模型随机性,可以用于存在较多噪声的数据集或者数据集相对简单(需要减少过拟合风险)的场景中objective:指定目标函数。strorcallable,default=regression回归问题:regression,regression_l1,huber,fair,poisson,quantile,mape,gamma,tweedie分类问题:binary,multiclass,multiclassova。对于多分类num_class参数也应该设置交叉熵:cross_entropy,cross_entropy_lambda,排序问题:lambdarank,rank_xendcgdata_sample_strategy:default=baggingbagging:机装袋取样。注意,当bagging_freq>0且bagging_fraction1Debug样本处理参数NameDescriptionaliasesis_unbalance是否不平衡数据集,仅用于分类任务。默认Falseunbalance,unbalanced_setsscale_pos_weight调整正样本权重,仅用于分类任务。默认1.0categorical_feature识别分类特征名称。e.g.categorical_feature=0,1,2orcategorical_feature=name:c1,c2,c3特征处理参数NameDescriptionaliasesbin_construct_sample_cnt该参数表示对连续变量进行分箱时(直方图优化过程)抽取样本的个数,默认取值为200000subsample_for_binsaved_feature_importance_type特征重要性计算方式,默认为0,表示在模型中被选中作为分裂特征的次数,可选1,表示在模型中的分裂增益之和作为重要性评估指标max_cat_threshold分类特征的最大拆分点数量,默认值为32cat_l2分类特征L2正则化系数,默认值为10.0cat_smooth减少分类特征中噪声的影响,特别是对于数据很少的类别,默认值为10.0max_cat_to_onehot当分类特征类别数小于或等于max_cat_to_onehot时将使用其他拆分算法决策树生成NameDescriptionaliasesmax_depth树的最大深度,默认值为-1,表示无限制num_leaves一棵树上的叶子节点数,默认值为31num_leaf,max_leaves,max_leaf,max_leaf_nodesmin_data_in_leaf单个叶子节点上的最小样本数量,默认值为20。较大的值可以防止过拟合。min_data_per_leaf,min_data,min_child_samples,min_samples_leafmin_sum_hessian_in_leaf一片叶子节点的最小权重和,默认值为1e-3。较大的值可以防止过拟合。min_sum_hessian_per_leaf,min_sum_hessian,min_hessian,min_child_weightbagging_fraction训练时的抽样比例,默认值为1.0。对于二分类问题,还可控制正负样本抽样比例pos_bagging_fraction和neg_bagging_fractionsub_row,subsample,baggingbagging_freq抽样频率,表示每隔几轮进行一次样本抽样,默认取值为0,表示不进行随机抽样。subsample_freqfeature_fraction在每次迭代(树的构建)时,随机选择特征的比例,取值范围为(0,1],默认为1.0。sub_feature,colsample_bytreefeature_fraction_bynode每个树节点上随机选择一个特征子集,默认为1.0。sub_feature_bynode,colsample_bynodeextra_trees极端随机树。默认为False,如果设置为True,在节点拆分时,LightGBM将只为每个特征选择一个随机选择的阈值min_gain_to_split再分裂所需最小增益,默认值为0,表示无限制min_split_gain注意:feature_fraction不受subsample_freq影响。同时需要注意的是,LGBM和随机森林不同,随机森林是每棵树的每次分裂时都随机分配特征,而LGBM是每次构建一颗树时随机分配一个特征子集,这颗树在成长过程中每次分裂都是依据这个特征子集进行生长。模型训练NameDescriptionaliasesdata用于训练的数据集train,train_data,train_data_file,data_filenamevalid验证/测试数据,支持多个验证数据,使用逗号,分隔test,valid_data,valid_data_file,test_data,test_data_file,valid_filenamesnum_iterations提升迭代次数,即生成的基学习器的数量,默认值100。注意:对于多分类问题,树的数量等于num_class*num_iterationsnum_iteration,n_iter,num_tree,num_trees,num_round,num_rounds,nrounds,num_boost_round,n_estimators,max_iterlearning_rate学习率,即每次迭代中梯度提升的步长,默认值0.1。shrinkage_rate,etalambda_l1L1正则化系数,默认值为0reg_alpha,l1_regularizationlambda_l2L2正则化系数,默认值为0reg_lambda,lambda,l2_regularizationmetric评估指标,默认“”metrics,metric_typesmin_data_per_group每个分类组的最小数据数量,默认值为100input_model对于prediction任务,该模型将用于预测;对于train任务,将从在这个模型基础上继续训练model_input,model_in损失函数Objk=∑i=1Nl(yi,yi^)+γT+12λ∑j=1Twj2+α∑j=1TwjObj_k=\sum_{i=1}^Nl(y_i,\hat{y_i})+\gammaT+\frac{1}{2}\lambda\sum_{j=1}^Tw_j^2+\alpha\sum_{j=1}^Tw_jObjk​=i=1∑N​l(yi​,yi​^​)+γT+21​λj=1∑T​wj2​+αj=1∑T​wj​其中TTT表示当前第kkk棵树上的叶子总量,wjw_jwj​则代表当前树上第jjj片叶子的叶子权重(leafweights),即当前叶子jjj的预测值。正则项有两个:使用平方的ℓ2\ell_2ℓ2​正则项与使用绝对值的ℓ1\ell_1ℓ1​正则项。部分参数在可模型训练lightgbm.train时传递值:lightgbm.train(params,train_set,num_boost_round=100,valid_sets=None,valid_names=None,feval=None,init_model=None,feature_name='auto',categorical_feature='auto',keep_training_booster=False,callbacks=None)lightgbm.cv(params,train_set,num_boost_round=100,folds=None,nfold=5,stratified=True,shuffle=True,metrics=None,feval=None,init_model=None,feature_name='auto',categorical_feature='auto',fpreproc=None,seed=0,callbacks=None,eval_train_metric=False,return_cvbooster=False)12345678910111213141516171819202122232425262728注意:通过params(dict)传递的值优先于通过参数提供的值。其中feval用来自定义评估函数。#self-definedevalmetric#f(preds:array,train_dataataset)->name:str,eval_result:float,is_higher_better:bool#RelativeAbsoluteError(RAE)defrae(preds,train_data):labels=train_data.get_label()return'RAE',np.sum(np.abs(preds-labels))/np.sum(np.abs(np.mean(labels)-labels)),False#Startingtrainingwithcustomevalfunctions...lgb.train(dtrainvalid_sets=[dtrain,dtest],feval=rae,callbacks=[lgb.early_stopping(5)])123456789101112注意:sklearnAPI自定义评估函数有所不同f(y_true,y_pred)->name,eval_result,is_higher_better。调用fit方法时传递给eval_metric参数。回调参数callbacks参数标识在每次迭代中应用的回调函数列表。方法Createacallbacklightgbm.early_stopping(stopping_rounds)回调提前停止策略,控制过拟合风险,当验证集上的精度若干轮不下降,提前停止训练。lightgbm.log_evaluation([period,show_stdv])输出评估结果的频率lightgbm.record_evaluation(eval_result)在eval_result中记录评估结果lightgbm.reset_parameter(**kwargs)第一次迭代后重置参数lightgbm可通过在callback中添加reset_parameter传递学习率,从而实现学习率衰减(learningratedecay)。学习率接受两种参数类型:num_boost_round长度的list以当前迭代次数为参数的函数function(curr_iter)#reset_parametercallbackaccepts:#1.listwithlength=num_boost_round#2.function(curr_iter)bst=lgb.train(params,dtrain,num_boost_round=10,init_model=gbm,valid_sets=deval,callbacks=[lgb.reset_parameter(learning_rate=lambdaiter:0.05*(0.99**iter))])#changeotherparametersduringtrainingbst=lgb.train(params,dtrain,num_boost_round=10,init_model=gbm,valid_sets=deval,callbacks=[lgb.reset_parameter(bagging_fraction=[0.7]*5+[0.6]*5)])1234567891011121314151617Scikit-LearnAPILGBM的sklearnAPI支持使用sklearn的调用风格和语言习惯进行LGBM模型训练,数据读取环节支持直接读取本地的Numpy或Pandas格式数据,而在实际训练过程中需要先实例化评估器并设置超参数,然后通过.fit的方式进行训练,并且可以直接调用gridsearch进行超参数搜索,也可以使用其他sklearn提供的高阶工具,如构建机器学习流、进行特征筛选或者进行模型融合等。总的来看,LGBM的sklearnAPI更加轻量、便捷,并且能够无缝衔接sklearn中其他评估器,快速实现sklearn提供的高阶功能,对于熟悉sklearn的用户而言非常友好;而原生API则会复杂很多,但同时也提供了大量sklearnAPI无法实现的复杂功能,若能够合理使用,则可以实现相比sklearnAPI更精准的建模结果、更高效的建模流程。modulecommentLGBMModelImplementationofthescikit-learnAPIforLightGBM.LGBMClassifierLightGBMclassifier.LGBMRegressorLightGBMregressor.LGBMRankerLightGBMranker.其中LGBMModel是LightGBM的基本模型类,它是一个泛型模型类,可以用于各种类型的问题(如分类、回归等)。通常,我们不直接使用LGBMModel,而是使用针对特定任务的子类使用不同的类,即分类问题使用LGBMClassifier、回归问题使用LGBMRegressor,而排序问题则使用LGBMRanker。以LGBMClassifier为例,默认参数如下:LGBMClassifier(boosting_type:str='gbdt',num_leaves:int=31,max_depth:int=-1,learning_rate:float=0.1,n_estimators:int=100,subsample_for_bin:int=200000,objective:Union[str,Callable,NoneType]=None,class_weight:Union[Dict,str,NoneType]=None,min_split_gain:float=0.0,min_child_weight:float=0.001,min_child_samples:int=20,subsample:float=1.0,subsample_freq:int=0,colsample_bytree:float=1.0,reg_alpha:float=0.0,reg_lambda:float=0.0,random_state:Union[int,numpy.random.mtrand.RandomState,NoneType]=None,n_jobs:int=-1,silent:Union[bool,str]='warn',importance_type:str='split',**kwargs,)1234567891011121314151617181920212223具体的模型训练过程和sklearn中其他模型一样,通过fit进行训练,并利用predict进行结果输出:importlightgbmaslgbfromsklearn.datasetsimportload_bostonfromsklearn.model_selectionimporttrain_test_split#Step1:loadorcreateyourdatasetX,y=load_boston(return_X_y=True)X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3)#Step2:Traininggbm=lgb.LGBMRegressor(num_leaves=31,learning_rate=0.05,n_estimators=20)gbm.fit(X_train,y_train,eval_set=[(X_test,y_test)],eval_metric='l1',callbacks=[lgb.early_stopping(5)])#Step5redicty_pred=gbm.predict(X_test,num_iteration=gbm.best_iteration_)y_score=gbm.predict_proba(X_testnum_iteration=gbm.best_iteration_)#Step6:Evaluatermse_test=mean_squared_error(y_test,y_pred)**0.5print(f'TheRMSEofpredictionis:{rmse_test}')#featureimportancesprint(f'Featureimportances:{list(gbm.feature_importances_)}')12345678910111213141516171819202122232425262728可以与sklearn中其他方法无缝衔接:#otherscikit-learnmodulesfromsklearn.model_selectionimportGridSearchCVparam_grid={'learning_rate':[0.01,0.1,1],'n_estimators':[20,40]}gbm=GridSearchCV(estimator,param_grid,cv=3)gbm.fit(X_train,y_train)print(f'Bestparametersfoundbygridsearchare:{gbm.best_params_}')123456789101112自定义损失函数原生接口在lgb.train中通过参数params["objective"]和feval来自定义损失函数和评估函数。老版本lightgbm自定义损失函数需要通过lgb.train中的参数fobj传递,最新版本改为直接在配置params时通过objective传递,fobj参数已经废弃。advanced_example.py注意:在LightGBM中,自定义损失函数需要返回损失函数的一阶(grad)和二阶(hess)导数。自定义损失函数后,模型的输出不在是[0,1]概率输出,而是sigmoid函数之前的输入值。自定义损失函数后,模型的输出已经发生改变,需要写出对应的评估函数。自定义损失函数后,LightGBM默认的boost_from_average=True失效,按照GBDT的框架,对于利用logloss来优化的二分类问题,样本的初始值为训练集标签的均值,在自定义损失函数后,系统无法获取到这个初始化值,导致收敛速度变慢。可以在构建lgb.Dataset时,利用init_score参数手动完成。自定义损失函数后,模型输出需要手动进行sigmoid函数变换损失函数:f(preds,train_data)->grad,hess,配置在lgb.train的params字典中,使用objective参数传递。评估函数:f(preds,train_data)->name,eval_result,is_higher_better,使用lgb.train的feval参数传递。#NOTE:whenyoudocustomizedlossfunction,thedefaultpredictionvalueismargin#Thismaymakebuilt-inevaluationmetriccalculatewrongresults#Forexample,wearedoingloglikelihoodloss,thepredictionisscorebeforelogistictransformation#Keepthisinmindwhenyouusethecustomization#self-definedobjectivefunction#f(preds:array,train_dataataset)->grad:array,hess:array#loglikelihoodlossfromscipyimportspecialdefloglikelihood(preds,train_data):labels=train_data.get_label()preds=special.expit(preds)grad=preds-labelshess=preds*(1.0-preds)returngrad,hess#Passcustomobjectivefunctionthroughparamsparams={"objective":loglikelihood}#self-definedevalmetric#f(preds:array,train_dataataset)->name:str,eval_result:float,is_higher_better:booldefbinary_error(preds,train_data):labels=train_data.get_label()preds=special.expit(preds)return"error",np.mean(labels!=(preds>0.5)),Falsegbm=lgb.train(params,train_data,num_boost_round=100,feval=binary_error,valid_sets=test_data)y_pred=special.expit(gbm.predict(X_test))1234567891011121314151617181920212223242526272829303132sklearnAPIsklearnAPI自定义损失函数和评估函数和原生接口有所不同。损失函数:f(y_true,y_pred)->grad,hess,新建sklearn模型实例时使用objective参数传递。评估函数:f(y_true,y_pred)->name,eval_result,is_higher_better,调用fit方法时传递给eval_metric参数。#self-definedobjectivefunction#f(y_true:array,y_pred:array)->grad:array,hess:array#loglikelihoodlossfromscipyimportspecialdefloglikelihood(y_true,y_pred):y_pred=special.expit(y_pred)grad=y_pred-y_truehess=y_pred*(1.0-y_pred)returngrad,hess#Passcustomobjectivefunctionthroughobjectivemodel=LGBMModel(objective=loglikelihood,n_estimators=100)#self-definedevalmetric#f(y_true:array,y_pred:array)->name:str,eval_result:float,is_higher_better:booldefbinary_error(y_true,y_pred):y_pred=special.expit(y_pred)return"error",np.mean(y_true!=(y_pred>0.5)),Falsemodel.fit(X_train,y_train,eval_metric=binary_error,eval_set=[(X_test,y_test)])y_pred=special.expit(model.predict_proba(X_test))1234567891011121314151617181920212223242526可视化modulecommentplot_importance(booster)绘制模型的特征重要性。plot_split_value_histogram(booster,feature)绘制模型指定特征的拆分值直方图plot_metric(booster)绘制训练期间的模型得分plot_tree(booster)绘制指定的树create_tree_digraph(booster)创建指定树的二叉图文件evals_result={}#torecordevalresultsforplottinggbm=lgb.train(params,dtrain,num_boost_round=100,valid_sets=[dtrain,deval],callbacks=[lgb.log_evaluation(10),lgb.record_evaluation(evals_result)])#Plottingmetricsrecordedduringtrainingax=lgb.plot_metric(evals_result,metric='l1')plt.show()#Plottingfeatureimportancesax=lgb.plot_importance(gbm,max_num_features=10)plt.show()#Plottingsplitvaluehistogramax=lgb.plot_split_value_histogram(gbm,feature='f26',bins='auto')plt.show()#Plotting54thtree(onetreeusecategoricalfeaturetosplit)ax=lgb.plot_tree(gbm,tree_index=53,figsize=(15,15),show_info=['split_gain'])plt.show()#Plotting54thtreewithgraphvizgraph=lgb.create_tree_digraph(gbm,tree_index=53,name='Tree54')graph.render(view=True)12345678910111213141516171819202122232425262728293031继续训练lightGBM有两种增量学习方式:init_model参数:如果init_model不为None,将从这个模型基础上继续训练,添加num_boost_round棵新树#init_modelaccepts:#1.modelfilename#2.Booster()bst=lgb.train(previous_params,new_data,num_boost_round=10,init_model=previous_model,valid_sets=eval_data,keep_training_booster=True)12345678910其中keep_training_booster(bool)参数表示返回的模型(booster)是否将用于保持训练,默认False。当模型非常大并导致内存错误时,可以尝试将此参数设置为True,以避免model_to_string转换。然后仍然可以使用返回的booster作为init_model,用于未来的继续训练。调用refit方法:在原有模型的树结构都不变的基础上,重新拟合新数据更新叶子节点权重#在参数字典中配置params={ 'task':'refit', 'refit_decay_rate':0.9, 'boosting_type':'gbdt', 'objective':'binary', 'metric':'auc' }bst=lgb.train( params, dtrain, num_boost_round=20, valid_sets=[dtrain,deval] )#用返回的模型(Booster)重新拟合bst.refit(data=X_train,label=y_train,decay_rate=0.9,reference=None)1234567891011121314151617181920212223其中refit_decay_rate控制refit任务中学习率的衰减。重新拟合后,叶子结点的输出的计算公式为leaf_output=refit_decay_rate*old_leaf_output+(1.0-refit_decay_rate)*new_leaf_output1分布式学习LGBM还提供了分布式计算版本和GPU计算版本进行加速计算,其中分布式计算模式下支持从HDFS(HadoopDistributedFileSystem)系统中进行数据读取和计算,而GPU计算模式下则提供了GPUversion(借助OpenCL,即OpenComputingLanguage来实现多种不同GPU的加速计算)和CUDAversion(借助CUDA,即ComputeUnifiedDeviceArchitecture来实现NVIDIAGPU加速)。不过,不同于深度学习更倾向于使用CUDA加速,对于LGBM而言,由于目前CUDAversion只能在Linux操作系统下实现,因此大多数情况下,我们往往会选择支持Windows系统的GPUversion进行GPU加速计算。LightGBM目前提供3种分布式学习算法:ParallelAlgorithmHowtoUseDataparalleltree_learner=dataFeatureparalleltree_learner=featureVotingparalleltree_learner=voting这些算法适用于不同的场景:#dataissmall#dataislarge#featureissmallFeatureParallelDataParallel#featureislargeFeatureParallelVotingParalleltree_learner参数控制分布式学习方法。default=serial,aliases:tree,tree_type,tree_learner_typeserial:单机学习feature:特征并行,别名:feature_paralleldata:数据并行,别名:data_parallelvoting:投票平行,别名:voting_parallelLightGBMwithPySpark要在spark上使用LightGBM,需要安装SynapseML包,原名MMLSpark,由微软开发维护。SynapseML建立在ApacheSpark分布式计算框架上,与SparkML/MLLib库共享相同的API,允许您将SynapseML模型无缝嵌入到现有的ApacheSpark工作流程中。SynapseML在Python中安装:首先,默认已经安装好了PySpark,然后,通过pyspark.sql.SparkSession配置会自动下载并安装到现有的Spark集群上importpyspark#Use0.11.4-spark3.3versionforSpark3.3and1.0.2versionforSpark3.4spark=pyspark.sql.SparkSession.builder.appName("MyApp")\.config("spark.jars.packages","com.microsoft.azure:synapseml_2.12:1.0.2")\.config("spark.jars.repositories","https://mmlspark.azureedge.net/maven")\.getOrCreate()importsynapse.ml1234567或者通过启动Spark时配置--packages选项#Use0.11.4-spark3.3versionforSpark3.3and1.0.2versionforSpark3.4spark-shell--packagescom.microsoft.azure:synapseml_2.12:1.0.2pyspark--packagescom.microsoft.azure:synapseml_2.12:1.0.2spark-submit--packagescom.microsoft.azure:synapseml_2.12:1.0.2MyApp.jar1234这个包比较大,第一次安装需要较长时间。算法说明LightGBMClassifier用于构建分类模型。例如,为了预测公司是否破产,我们可以使用LightGBMClassifier构建一个二进制分类模型。LightGBMRegressor用于构建回归模型。例如,为了预测房价,我们可以用LightGBMRegressor建立一个回归模型。LightGBMRanker用于构建排名模型。例如,为了预测网站搜索结果的相关性,我们可以使用LightGBMRanker构建一个排名模型。在PySpark中,您可以通过以下方式运行LightGBMClassifier:fromsynapse.ml.lightgbmimportLightGBMClassifiermodel=LightGBMClassifier(learningRate=0.3,numIterations=100,numLeaves=31).fit(train)1234LightGBM的参数比SynapseML公开的要多得多,若要添加额外的参数,请使用passThroughArgs字符串参数配置。fromsynapse.ml.lightgbmimportLightGBMClassifiermodel=LightGBMClassifier(passThroughArgs="force_row_wise=truemin_sum_hessian_in_leaf=2e-3",numIterations=100,numLeaves=31).fit(train)1234您可以混合passThroughArgs和显式args,如示例所示。SynapseML合并它们以创建一个要发送到LightGBM的参数字符串。如果您在两个地方都设置参数,则以passThroughArgs为优先。示例:#Readdatasetfromsynapse.ml.core.platformimport*df=(spark.read.format("csv").option("header",True).option("inferSchema",True).load("wasbs://publicwasb@mmlspark.blob.core.windows.net/company_bankruptcy_prediction_data.csv"))#printdatasetsizeprint("recordsread:"+str(df.count()))print("Schema:")df.printSchema()display(df)#Splitthedatasetintotrainandtesttrain,test=df.randomSplit([0.85,0.15],seed=1)#Addfeaturizertoconvertfeaturestovectorfrompyspark.ml.featureimportVectorAssemblerfeature_cols=df.columns[1:]featurizer=VectorAssembler(inputCols=feature_cols,outputCol="features")train_data=featurizer.transform(train)["Bankrupt?","features"]test_data=featurizer.transform(test)["Bankrupt?","features"]#Checkifthedataisunbalanceddisplay(train_data.groupBy("Bankrupt?").count())#ModelTrainingfromsynapse.ml.lightgbmimportLightGBMClassifiermodel=LightGBMClassifier(objective="binary",featuresCol="features",labelCol="Bankrupt?",isUnbalance=True)model=model.fit(train_data)#"saveNativeModel"allowsyoutoextracttheunderlyinglightGBMmodelforfastdeploymentafteryoutrainonSpark.fromsynapse.ml.lightgbmimportLightGBMClassificationModelifrunning_on_synapse():model.saveNativeModel("/models/lgbmclassifier.model")model=LightGBMClassificationModel.loadNativeModelFromFile("/models/lgbmclassifier.model")ifrunning_on_synapse_internal():model.saveNativeModel("Files/models/lgbmclassifier.model")model=LightGBMClassificationModel.loadNativeModelFromFile("Files/models/lgbmclassifier.model")else:model.saveNativeModel("/tmp/lgbmclassifier.model")model=LightGBMClassificationModel.loadNativeModelFromFile("/tmp/lgbmclassifier.model")#FeatureImportancesVisualizationimportpandasaspdimportmatplotlib.pyplotaspltfeature_importances=model.getFeatureImportances()fi=pd.Series(feature_importances,index=feature_cols)fi=fi.sort_values(ascending=True)f_index=fi.indexf_values=fi.values#printfeatureimportancesprint("f_index:",f_index)print("f_values:",f_values)#plotx_index=list(range(len(fi)))x_index=[x/len(fi)forxinx_index]plt.rcParams["figure.figsize"]=(20,20)plt.barh(x_index,f_values,height=0.028,align="center",color="tan",tick_label=f_index)plt.xlabel("importances")plt.ylabel("features")plt.show()#ModelPredictionpredictions=model.transform(test_data)predictions.limit(10).toPandas()fromsynapse.ml.trainimportComputeModelStatisticsmetrics=ComputeModelStatistics(evaluationMetric="classification",labelCol="Bankrupt?",scoredLabelsCol="prediction",).transform(predictions)display(metrics)12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 会员注册

本版积分规则

QQ|手机版|心飞设计-版权所有:微度网络信息技术服务中心 ( 鲁ICP备17032091号-12 )|网站地图

GMT+8, 2025-1-10 22:53 , Processed in 0.448356 second(s), 25 queries .

Powered by Discuz! X3.5

© 2001-2025 Discuz! Team.

快速回复 返回顶部 返回列表