sklearn – gitweixin

sklearn 5月 16,2022

随机森林排序特征选择后结果feature_importances为0，normalized_importance的值都为nan,cumulative_importance的值都为nan

在做评分卡的工作，使用随机森林排序特征选择，目测有因素相关联大，但运行下面的代码，最后df_train_woe的结果为空，百思不得其解。

fs = FeatureSelector(data = df_train_woe[sel_var], labels = data_train_bin.target)
    ##一次性去除所有的不满足特征  
    fs.identify_all(selection_params = {'missing_threshold': 0.9,
                                         'correlation_threshold': 0.8,
                                         'task': 'classification',
                                         'eval_metric': 'binary_error',
                                         'max_depth':2,
                                         'cumulative_importance': 0.90})
 df_train_woe = fs.remove(methods = 'all')
    df_train_woe['target'] = data_train_bin.target

调试feature_selector.py的代码，发现identify_zero_importance方法中，feature_importances的值都为0，normalized_importance的值都为nan，cumulative_importance的值都为nan。

后来觉得代码反复验证是没问题，有可能是样本太少（才几十条），手头增加样本到几百条，果然运行后结果没问题了。这真是个坑，在这里先记录一下，有空时深入研究一下代码和原理。

作者 east

分类归档sklearn