添加新功能 07 27

cb1c5b5d · 舒皓月 · f0404627 · cb1c5b5d · cb1c5b5d · cb1c5b5d
Commit cb1c5b5d authored Jul 27, 2019 by 舒皓月
12 changed files
--- a/.gitignore
+++ b/.gitignore
+tmp.py
+test.py
\ No newline at end of file
--- a/.idea/misc.xml
+++ b/.idea/misc.xml
+<?xml version="1.0" encoding="UTF-8"?>
+<project version="4">
+  <component name="ProjectRootManager" version="2" project-jdk-name="Python 3.7 (data_test_work_space)" project-jdk-type="Python SDK" />
+</project>
\ No newline at end of file
--- a/.idea/model_monitor.iml
+++ b/.idea/model_monitor.iml
+<?xml version="1.0" encoding="UTF-8"?>
+<module type="PYTHON_MODULE" version="4">
+  <component name="NewModuleRootManager">
+    <content url="file://$MODULE_DIR$" />
+    <orderEntry type="inheritedJdk" />
+    <orderEntry type="sourceFolder" forTests="false" />
+  </component>
+  <component name="TestRunnerService">
+    <option name="projectConfiguration" value="pytest" />
+    <option name="PROJECT_TEST_RUNNER" value="pytest" />
+  </component>
+</module>
\ No newline at end of file
--- a/.idea/modules.xml
+++ b/.idea/modules.xml
+<?xml version="1.0" encoding="UTF-8"?>
+<project version="4">
+  <component name="ProjectModuleManager">
+    <modules>
+      <module fileurl="file://$PROJECT_DIR$/.idea/model_monitor.iml" filepath="$PROJECT_DIR$/.idea/model_monitor.iml" />
+    </modules>
+  </component>
+</project>
\ No newline at end of file
--- a/.idea/vcs.xml
+++ b/.idea/vcs.xml
+<?xml version="1.0" encoding="UTF-8"?>
+<project version="4">
+  <component name="VcsDirectoryMappings">
+    <mapping directory="$PROJECT_DIR$" vcs="Git" />
+  </component>
+</project>
\ No newline at end of file
--- a/.idea/workspace.xml
+++ b/.idea/workspace.xml
--- a/README.md
+++ b/README.md
@@ -15,12 +15,22 @@
  - Lift Chart


-# 版本信息 - 新代码
+# 版本信息

 - V 0.0.1

  基本完成对PSI和Lift Chart关于模型分在MongoDB的重写.
  
+- V 0.0.2
+
+  - 删除last_month参数, 设定运行当天的前一天为最新日期, 往前的(num_month - 1)月的1号为起始日期, 最新日期往前45天(若设定passdue_day=15)那天为有响应的最新日期.
+
+  - 对PSI的计算, 时间跨度为(起始日期 --> 最新日期)
+
+  - 对AUC的计算, 时间跨度为(起始日期 --> 有响应最新日期)
+
+  - 添加对PSI和AUCR(后面月份相对基准月的AUC比率)的异常检测.
+
 # VLM

 - 待重写.
@@ -29,49 +39,44 @@

 - 因为这两个指标的统计都需要用到模型分, 所以放到一起.

-## 代码流程
+## 计算流程

 - 首先对需要计算的模型分, 在指定的统一时间跨度内进行数据抽取(在MySQL和MongoDB中). 包括如下一些主要字段:
  - 模型分1, 模型分2, ...
  - 订单号, 申请时间, 申请类型, 渠道类型, 逾期天数.
 - 根据预定义逾期阈值, 得到是否逾期标签.
 - 根据该模型分有记录的第一个月, 计算分箱规则(由模型分等频分箱区间, PSI在申请集上计算, AUC在放款集上计算).
- 分别对全样本, 首申/复申/复贷, 以及各达标客群(客群数量大于预设阈值), 计算每月统计信息:
-  - PSI:
+- 分别对全样本, 首申/复申/复贷, 以及各达标客群(客群数量大于预设阈值), 计算每月统计信息.
+- PSI统计信息:
  - 样本量.
  - 各分箱样本量
  - 各分箱样本量占比
  - 以该客群模型分有记录的第一个月为基准的PSI.
-  - Lift Chart:
+- Lift Chart统计信息:
  - 样本量
  - 各分箱样本量
  - 各分箱逾期率
  - AUC
  - 以该客群模型分有记录的第一个月为基准的AUC ratio.
- 统计表格信息, 方便筛选排序查看. 
-  - 包含以下字段:
+- 统计表格信息, 方便筛选排序查看, 包含以下字段:
  - 模型分名称.
  - a月样本量, b月样本量...
  - a月PSI, b月PSI...
  - a月AUC, b月AUC...
-  - NOTE:
+  - 某个客群是否异常(AUC明显下降, PSI较大).
+- NOTE:
  - 当某月样本量很小, 或者没有样本时, 标记为NaN. 对应的PSI, AUC也为NaN.
  - 当某月样本量比较小, 导致PSI, AUC计算异常(如某些分箱没有样本, 全为非逾期样本), 则标记为-999.
  - 基准月的PSI为0.
-  - 示例:
-
-![PSI](doc/image/C6640ABE-9017-42b5-A92A-2DE5601A15D8.png)

-![](doc/image/31EA97A8-19B7-45c6-8302-4148D19BAABA.png)
-
-## 代码使用方法
+## 使用方法

 - 准备一个Excel表格, 其中放置模型分名称, 以及对应的在数据库中的字段名.

- 创建一个模型监控对象(这样你就有对象了).
+- 创建一个模型监控对象(这样我们就有对象了^v^).

  ```python
-  mm = ModelMonitor(excel_path='./model_score.xlsx', save_path='./image/', last_month=7, num_month=4, min_user_group=200)
+  mm = ModelMonitor(excel_path='./model_score.xlsx', save_path='./image/', num_month=4, min_user_group=500, max_psi=0.1, min_aucr=0.8)
  ```

  - excel_path: Excel文件路径.
@@ -80,13 +85,7 @@

    不用自己再另外手动创建文件夹, 代码会判断文件夹是否存在并创建.

-  - last_month: 想要统计的最后一个月.
-
-  - num_month: 想要统计几个月.
-
-    如last_month=7, num_month=3, 表示统计4, 5, 6三个月的信息.
-
-    NOTE: AUC的计算逻辑为了保证样本有响应, 在此基础上还要往前推一个月, 会统计4, 5月的信息.
+  - num_month: 想要统计近期几个月(包含运行程序时所在的日期前一天, 如运行时为7.1, 则不包含7月数据, 如运行时为7.10, 则将7月9天算作7月数据).

  - min_user_group: 最小客群数量.

@@ -94,18 +93,22 @@

    反之颗粒越小, 最后统计图会越多.
  
+  - max_psi: 最大PSI, 大于则视为该客群异常.
+
+  - min_aucr: 最小AUCR, 小于则视为该客群异常.
+
 - 执行run函数.

  ```python
  mm.run()
  ```

- 输出
+- 输出:

-  - 图片保存在./image中.
+- 图片保存在./image中.
  - PSI: ./image/PSI
  - Lift Chart: ./image/AUC
-  - 统计信息.
+- 统计信息.
  - PSI统计信息: ./psi_info.csv
  - AUC统计信息: ./auc_info.csv

@@ -113,6 +116,7 @@

 - 添加对存在MySQL中模型分计算PSI, AUC的代码.
 - 完成对VLM的重写.
+- 部分(量信分, app模型)模型分报错, 进一步与模型维护者交流, 看是否字段名或者其它地方有问题.

 # 贡献


--- a/doc/image/31EA97A8-19B7-45c6-8302-4148D19BAABA.png
+++ b/doc/image/31EA97A8-19B7-45c6-8302-4148D19BAABA.png
--- a/doc/image/C6640ABE-9017-42b5-A92A-2DE5601A15D8.png
+++ b/doc/image/C6640ABE-9017-42b5-A92A-2DE5601A15D8.png
--- a/model_monitor_PSI_AUC.py
+++ b/model_monitor_PSI_AUC.py
@@ -19,8 +19,8 @@ from collections import OrderedDict


 class ModelMonitor:
-    def __init__(self, excel_path='../model_score.xlsx', sheet_name='mongo_model',
-                 passdue_day=15, save_path='../image/',
+    def __init__(self, excel_path='./model_score.xlsx', sheet_name='mongo_model',
+                 passdue_day=15, save_path='./image/',
                 last_month=7, num_month=4, min_user_group=500):

        # 考虑到数据库配置基本不变, 所以不设置创建对象时对应输入变量.
@@ -426,28 +426,27 @@ class ModelMonitor:

    def run(self):
        # 获取MySQL数据, 取last_month往前num_month个月数据.
-        # self.mysql_df = self.sql_query('''SELECT order_no, applied_at,
-        #                                 applied_type, applied_from, applied_channel, transacted, passdue_day
-        #                                 FROM risk_analysis
-        #                                 WHERE applied_at > "2019-%s-01 00:00:00"
-        #                                 AND applied_at < "2019-%s-01 00:00:00"'''
-        #                                % (self.int2str(self.last_month - self.num_month), self.int2str(self.last_month)))
-        # print('MySQL数据获取成功.')
+        self.mysql_df = self.sql_query('''SELECT order_no, applied_at,
+                                        applied_type, applied_from, applied_channel, transacted, passdue_day
+                                        FROM risk_analysis
+                                        WHERE applied_at > "2019-%s-01 00:00:00"
+                                        AND applied_at < "2019-%s-01 00:00:00"'''
+                                       % (self.int2str(self.last_month - self.num_month), self.int2str(self.last_month)))
+        print('MySQL数据获取成功.')
        # self.mysql_df.to_csv('./mysql_data.csv', index=False)
-
-        self.mysql_df = pd.read_csv('./mysql_data.csv')
+        # self.mysql_df = pd.read_csv('./mysql_data.csv')

        # 获取MongoDB数据, 取last_month往前num_month个月数据.
-        # condition = {'wf_created_at': {'$gte': '2019-%s-01 00:00:00' % self.int2str(self.last_month - self.num_month),
-        #                                '$lte': '2019-%s-01 00:00:00' % self.int2str(self.last_month)}}
-        # fields = {'wf_biz_no': 1, 'wf_created_at': 1}
-        # for f in self.model_feild_list:  # 加入Excel中预置的模型分名称
-        #     fields[f] = 1
-        # self.mongo_df = self.mongo_query(condition, fields)
-        # print('MongoDB数据获取成功.')
-        # self.mongo_df.to_csv('./mongo_data.csv', index=False)
+        condition = {'wf_created_at': {'$gte': '2019-%s-01 00:00:00' % self.int2str(self.last_month - self.num_month),
+                                       '$lte': '2019-%s-01 00:00:00' % self.int2str(self.last_month)}}
+        fields = {'wf_biz_no': 1, 'wf_created_at': 1}
+        for f in self.model_feild_list:  # 加入Excel中预置的模型分名称
+            fields[f] = 1
+        self.mongo_df = self.mongo_query(condition, fields)
+        print('MongoDB数据获取成功.')

-        self.mongo_df = pd.read_csv('./mongo_data.csv')
+        # self.mongo_df.to_csv('./mongo_data.csv', index=False)
+        # self.mongo_df = pd.read_csv('./mongo_data.csv')

        # MySQL数据去重.
        self.mysql_df = self.mysql_df.sort_values('passdue_day')
@@ -459,8 +458,27 @@ class ModelMonitor:
                                   left_on='order_no', right_on='wf_biz_no', how='left')
        ## 定义逾期用户.
        self.merge_data['overdue'] = self.merge_data['passdue_day'] > self.passdue_day
+
+        # 清洗数据.
+        def clean_data(data):
+            try:
+                return float(data)
+            except:
+                return np.nan
+        na_field_list = []
        for field in self.model_feild_list:
-            self.merge_data[field] = self.merge_data[field].astype('float')
+            if field in self.merge_data.columns.tolist():
+                print('正在清洗%s' % self.model_feild_name_dict[field])
+                self.merge_data[field] = self.merge_data[field].apply(clean_data)
+            else:
+                na_field_list.append(field)
+        ## 去除因为一些原因未抽取到的字段.
+        print('不包含以下字段:')
+        for field in na_field_list:
+            self.model_feild_list.remove(field)
+            self.model_name_list.remove(self.model_feild_name_dict[field])
+            del self.model_feild_name_dict[field]
+            print(self.model_feild_name_dict[field])
        print('数据拼接完成.')

        # 数据按月划分.
@@ -489,6 +507,6 @@ class ModelMonitor:
        print('统计信息保存成功.')


-if __name__ == '__main__':
-    pass
-    mm = ModelMonitor(excel_path='./model_score.xlsx', save_path='./image/', last_month=7, num_month=2)
+# if __name__ == '__main__':
+#     pass
+#     mm = ModelMonitor(excel_path='./model_score.xlsx', save_path='./image/', last_month=7, num_month=2)
--- a/model_score.xlsx
+++ b/model_score.xlsx
--- a/test.py
+++ b/test.py
-
-class Solution:
-    def __init__(self):
-
-        pass
-    def find_max_length(self, array, k):
-        if not array:
-            return 0
-        sum_subarray = array[0]
-        left, right = 0, 1
-        max_length = 0
-        while right < len(array):
-            if sum_subarray == k:
-                max_length = max(max_length, right - left)
-                sum_subarray += array[right]
-                right += 1
-
-
-
-
-