Evaluation
Evaluation for folds
For evaluation of the models on the folds, we use 4 folds cross-validation.
So, as we have 40k samples, we split them into 4 folds of 10k samples on the validation. And then we trained and validated on each fold separately.
Fold 1
|
metrics |
eval_loss |
0.00446131 |
eval_overall_precision |
0.783601 |
eval_overall_recall |
0.726919 |
eval_overall_f1 |
0.754196 |
eval_overall_accuracy |
0.936082 |
eval_macro_avg.precision |
0.665234 |
eval_macro_avg.recall |
0.60265 |
eval_macro_avg.f1-score |
0.630713 |
eval_macro_avg.support |
35418 |
eval_weighted_avg.precision |
0.780372 |
eval_weighted_avg.recall |
0.726919 |
eval_weighted_avg.f1-score |
0.751985 |
eval_weighted_avg.support |
35418 |
eval_micro_avg.precision |
0.783601 |
eval_micro_avg.recall |
0.726919 |
eval_micro_avg.f1-score |
0.754196 |
eval_micro_avg.support |
35418 |
eval_runtime |
1160.95 |
eval_samples_per_second |
52.292 |
eval_steps_per_second |
26.147 |
Fold 2
|
metrics |
eval_loss |
0.00431787 |
eval_overall_precision |
0.782074 |
eval_overall_recall |
0.712519 |
eval_overall_f1 |
0.745678 |
eval_overall_accuracy |
0.933556 |
eval_macro_avg.precision |
0.667987 |
eval_macro_avg.recall |
0.58284 |
eval_macro_avg.f1-score |
0.620279 |
eval_macro_avg.support |
36959 |
eval_weighted_avg.precision |
0.778731 |
eval_weighted_avg.recall |
0.712519 |
eval_weighted_avg.f1-score |
0.743216 |
eval_weighted_avg.support |
36959 |
eval_micro_avg.precision |
0.782074 |
eval_micro_avg.recall |
0.712519 |
eval_micro_avg.f1-score |
0.745678 |
eval_micro_avg.support |
36959 |
eval_runtime |
1255.45 |
eval_samples_per_second |
52.914 |
eval_steps_per_second |
26.457 |
Fold 3
|
metrics |
eval_loss |
0.00460493 |
eval_overall_precision |
0.804563 |
eval_overall_recall |
0.688634 |
eval_overall_f1 |
0.742098 |
eval_overall_accuracy |
0.932954 |
eval_macro_avg.precision |
0.694945 |
eval_macro_avg.recall |
0.561292 |
eval_macro_avg.f1-score |
0.617365 |
eval_macro_avg.support |
36311 |
eval_weighted_avg.precision |
0.798574 |
eval_weighted_avg.recall |
0.688634 |
eval_weighted_avg.f1-score |
0.737633 |
eval_weighted_avg.support |
36311 |
eval_micro_avg.precision |
0.804563 |
eval_micro_avg.recall |
0.688634 |
eval_micro_avg.f1-score |
0.742098 |
eval_micro_avg.support |
36311 |
eval_runtime |
1230.3 |
eval_samples_per_second |
51.72 |
eval_steps_per_second |
25.86 |
Fold 4
|
metrics |
eval_loss |
0.00429184 |
eval_overall_precision |
0.779286 |
eval_overall_recall |
0.728163 |
eval_overall_f1 |
0.752858 |
eval_overall_accuracy |
0.936848 |
eval_macro_avg.precision |
0.66828 |
eval_macro_avg.recall |
0.604989 |
eval_macro_avg.f1-score |
0.632891 |
eval_macro_avg.support |
35639 |
eval_weighted_avg.precision |
0.777074 |
eval_weighted_avg.recall |
0.728163 |
eval_weighted_avg.f1-score |
0.75088 |
eval_weighted_avg.support |
35639 |
eval_micro_avg.precision |
0.779286 |
eval_micro_avg.recall |
0.728163 |
eval_micro_avg.f1-score |
0.752858 |
eval_micro_avg.support |
35639 |
eval_runtime |
1220.92 |
eval_samples_per_second |
52.255 |
eval_steps_per_second |
26.128 |
Average
Average of the folds with range(plus/minus 1.96 std, ~95% CI):
|
metrics |
eval_loss |
0.00 ± 0.00 |
eval_ANG.precision |
0.81 ± 0.04 |
eval_ANG.recall |
0.77 ± 0.04 |
eval_ANG.f1 |
0.79 ± 0.04 |
eval_ANG.number |
237, 243, 219, 243 |
eval_DUC.precision |
0.69 ± 0.02 |
eval_DUC.recall |
0.57 ± 0.04 |
eval_DUC.f1 |
0.62 ± 0.02 |
eval_DUC.number |
597, 555, 563, 551 |
eval_EVE.precision |
0.61 ± 0.05 |
eval_EVE.recall |
0.58 ± 0.03 |
eval_EVE.f1 |
0.59 ± 0.01 |
eval_EVE.number |
792, 937, 791, 833 |
eval_FAC.precision |
0.62 ± 0.04 |
eval_FAC.recall |
0.47 ± 0.03 |
eval_FAC.f1 |
0.53 ± 0.01 |
eval_FAC.number |
1399, 1393, 1302, 1321 |
eval_GPE.precision |
0.86 ± 0.01 |
eval_GPE.recall |
0.82 ± 0.01 |
eval_GPE.f1 |
0.84 ± 0.01 |
eval_GPE.number |
8524, 8840, 8537, 8483 |
eval_INFORMAL.precision |
0.00 ± 0.00 |
eval_INFORMAL.recall |
0.00 ± 0.00 |
eval_INFORMAL.f1 |
0.00 ± 0.00 |
eval_INFORMAL.number |
10, 15, 12, 14 |
eval_LOC.precision |
0.66 ± 0.02 |
eval_LOC.recall |
0.56 ± 0.03 |
eval_LOC.f1 |
0.61 ± 0.02 |
eval_LOC.number |
1631, 1644, 1626, 1644 |
eval_MISC.precision |
0.70 ± 0.02 |
eval_MISC.recall |
0.59 ± 0.01 |
eval_MISC.f1 |
0.64 ± 0.01 |
eval_MISC.number |
2398, 2535, 2575, 2616 |
eval_ORG.precision |
0.74 ± 0.02 |
eval_ORG.recall |
0.67 ± 0.02 |
eval_ORG.f1 |
0.71 ± 0.01 |
eval_ORG.number |
6265, 6411, 6415, 6326 |
eval_PER.precision |
0.88 ± 0.00 |
eval_PER.recall |
0.83 ± 0.02 |
eval_PER.f1 |
0.85 ± 0.01 |
eval_PER.number |
6946, 7451, 7394, 7148 |
eval_TIMEX.precision |
0.82 ± 0.01 |
eval_TIMEX.recall |
0.75 ± 0.02 |
eval_TIMEX.f1 |
0.78 ± 0.01 |
eval_TIMEX.number |
3082, 3216, 3126, 2969 |
eval_TTL.precision |
0.69 ± 0.04 |
eval_TTL.recall |
0.63 ± 0.05 |
eval_TTL.f1 |
0.66 ± 0.01 |
eval_TTL.number |
2815, 2980, 3025, 2768 |
eval_WOA.precision |
0.69 ± 0.03 |
eval_WOA.recall |
0.41 ± 0.03 |
eval_WOA.f1 |
0.52 ± 0.02 |
eval_WOA.number |
722, 739, 726, 723 |
eval_overall_precision |
0.79 ± 0.01 |
eval_overall_recall |
0.71 ± 0.02 |
eval_overall_f1 |
0.75 ± 0.01 |
eval_overall_accuracy |
0.93 ± 0.00 |
eval_macro_avg.precision |
0.67 ± 0.01 |
eval_macro_avg.recall |
0.59 ± 0.02 |
eval_macro_avg.f1-score |
0.63 ± 0.01 |
eval_macro_avg.support |
35418.0, 36959.0, 36311.0, 35639.0 |
eval_weighted_avg.precision |
0.78 ± 0.01 |
eval_weighted_avg.recall |
0.71 ± 0.02 |
eval_weighted_avg.f1-score |
0.75 ± 0.01 |
eval_weighted_avg.support |
35418.0, 36959.0, 36311.0, 35639.0 |
eval_micro_avg.precision |
0.79 ± 0.01 |
eval_micro_avg.recall |
0.71 ± 0.02 |
eval_micro_avg.f1-score |
0.75 ± 0.01 |
eval_micro_avg.support |
35418.0, 36959.0, 36311.0, 35639.0 |
eval_runtime |
1216.91 ± 40.05 |
eval_samples_per_second |
52.292, 52.914, 51.72, 52.255 |
eval_steps_per_second |
26.147, 26.457, 25.86, 26.128 |
fold |
fold_1, fold_2, fold_3, fold_4 |