admin管理员组

文章数量:1608813

LightGBM(Light Gradient Boosting Machine) 是微软开源的一个实现 GBDT 算法的框架,支持高效率的并行训练。

  • 更快的训练速度

  • 更低的内存消耗

  • 更好的准确率

  • 分布式支持,可以快速处理海量数据

 

LightGBM是一个梯度提升框架,使用基于树的学习算法。 

LightGBM树的生长方式是垂直方向的,其他的算法都是水平方向的,也就是说LightGBM生长的是树的叶子,其他的算法生长的是树的层次。LightGBM选择具有最大误差的树叶进行生长,当生长同样的树叶,生长叶子的算法可以比基于层的算法减少更多的loss。

 

原生形式使用lightgbm(import lightgbm as lgb)

import lightgbm as lgb
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# 加载数据
iris = load_iris()
data = iris.data
target = iris.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)
print("Train data length:", len(X_train))
print("Test data length:", len(X_test))

# 转换为Dataset数据格式
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# 参数
params = {
    'task': 'train',
    'boosting_type': 'gbdt',  # 设置提升类型
    'objective': 'regression',  # 目标函数
    'metric': {'l2', 'auc'},  # 评估函数
    'num_leaves': 31,  # 叶子节点数
    'learning_rate': 0.05,  # 学习速率
    'feature_fraction': 0.9,  # 建树的特征选择比例
    'bagging_fraction': 0.8,  # 建树的样本采样比例
    'bagging_freq': 5,  # k 意味着每 k 次迭代执行bagging
    'verbose': 1  # <0 显示致命的, =0 显示错误 (警告), >0 显示信息
}

# 模型训练
gbm = lgb.train(params, lgb_train, num_boost_round=20, valid_sets=lgb_eval, early_stopping_rounds=5)

# 模型保存
gbm.save_model('model.txt')

# 模型加载
gbm = lgb.Booster(model_file='model.txt')

# 模型预测
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)

# 模型评估
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)
'''
Train data length: 120
Test data length: 30
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000019 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 89
[LightGBM] [Info] Number of data points in the train set: 120, number of used features: 4
[LightGBM] [Info] Start training from score 1.016667
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1]	valid_0's auc: 0.9775	valid_0's l2: 0.548619
Training until validation scores don't improve for 5 rounds
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[2]	valid_0's auc: 1	valid_0's l2: 0.500157
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[3]	valid_0's auc: 1	valid_0's l2: 0.454786
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[4]	valid_0's auc: 1	valid_0's l2: 0.414112
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[5]	valid_0's auc: 1	valid_0's l2: 0.377665
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[6]	valid_0's auc: 1	valid_0's l2: 0.346867
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[7]	valid_0's auc: 1	valid_0's l2: 0.319188
Early stopping, best iteration is:
[2]	valid_0's auc: 1	valid_0's l2: 0.500157
The rmse of prediction is: 0.7072175933903914
'''

 

Sklearn接口形式使用lightgbm(from lightgbm import LGBMRegressor)

from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import joblib

# 加载数据
iris = load_iris()
data = iris.data
target = iris.target

# 划分训练数据和测试数据
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)

# 模型训练
gbm = LGBMRegressor(objective='regression', num_leaves=31, learning_rate=0.05, n_estimators=20)
gbm.fit(X_train, y_train, eval_set=[(X_test, y_test)], eval_metric='l1', early_stopping_rounds=5)

# 模型存储
joblib.dump(gbm, 'loan_model.pkl')
# 模型加载
gbm = joblib.load('loan_model.pkl')

# 模型预测
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)

# 模型评估
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)

# 特征重要度
print('Feature importances:', list(gbm.feature_importances_))

# 网格搜索,参数优化
estimator = LGBMRegressor(num_leaves=31)
param_grid = {
    'learning_rate': [0.01, 0.1, 1],
    'n_estimators': [20, 40]
}
gbm = GridSearchCV(estimator, param_grid)
gbm.fit(X_train, y_train)
print('Best parameters found by grid search are:', gbm.best_params_)
'''
[1]	valid_0's l1: 0.564611	valid_0's l2: 0.53568
Training until validation scores don't improve for 5 rounds
[2]	valid_0's l1: 0.541868	valid_0's l2: 0.492686
[3]	valid_0's l1: 0.520262	valid_0's l2: 0.45387
[4]	valid_0's l1: 0.499592	valid_0's l2: 0.419784
[5]	valid_0's l1: 0.475829	valid_0's l2: 0.383425
[6]	valid_0's l1: 0.457481	valid_0's l2: 0.354883
[7]	valid_0's l1: 0.436038	valid_0's l2: 0.324898
[8]	valid_0's l1: 0.419327	valid_0's l2: 0.302255
[9]	valid_0's l1: 0.399978	valid_0's l2: 0.27748
[10]	valid_0's l1: 0.385154	valid_0's l2: 0.258424
[11]	valid_0's l1: 0.37125	valid_0's l2: 0.240029
[12]	valid_0's l1: 0.359304	valid_0's l2: 0.225339
[13]	valid_0's l1: 0.344684	valid_0's l2: 0.208233
[14]	valid_0's l1: 0.332142	valid_0's l2: 0.194488
[15]	valid_0's l1: 0.320227	valid_0's l2: 0.182062
[16]	valid_0's l1: 0.310099	valid_0's l2: 0.169595
[17]	valid_0's l1: 0.30074	valid_0's l2: 0.16047
[18]	valid_0's l1: 0.29047	valid_0's l2: 0.151185
[19]	valid_0's l1: 0.280713	valid_0's l2: 0.142789
[20]	valid_0's l1: 0.270687	valid_0's l2: 0.133844
Did not meet early stopping. Best iteration is:
[20]	valid_0's l1: 0.270687	valid_0's l2: 0.133844
The rmse of prediction is: 0.36584694593602285
Feature importances: [9, 6, 44, 10]
Best parameters found by grid search are: {'learning_rate': 0.1, 'n_estimators': 40}

'''

 

eval_metric:【默认=通过目标函数选择】
 rmse:均方根误差
 mae: 平均绝对值误差
 logloss:negative log-likelihood
 error:二分类错误率=错误分类数目/全部分类数目。对于预测,预测值>0.5被认为是正类,其他归为负类。error@t:不同的划分阈值可以通过 ‘t’进行设置
 merror:多分类错误率=错误分类数目/全部分类数目
 mlogloss:多分类log损失
 auc:曲线下的面积
 map:平均正确率

 

调参1:提高准确率":num_leaves, max_depth, learning_rate

调参2:降低过拟合 max_bin min_data_in_leaf

调参3:降低过拟合 正则化L1, L2

调参4:降低过拟合 数据抽样 列抽样

本文标签: 两种方式lightgbm