Evaluation
Evaluation for folds
For evaluation of the models on the folds, we use 4 folds cross-validation. So, as we have 40k samples, we split them into 4 folds of 10k samples on the validation. And then we trained and validated on each fold separately.
Fold 1
| metrics | |
|---|---|
| eval_loss | 0.00446131 |
| eval_overall_precision | 0.783601 |
| eval_overall_recall | 0.726919 |
| eval_overall_f1 | 0.754196 |
| eval_overall_accuracy | 0.936082 |
| eval_macro_avg.precision | 0.665234 |
| eval_macro_avg.recall | 0.60265 |
| eval_macro_avg.f1-score | 0.630713 |
| eval_macro_avg.support | 35418 |
| eval_weighted_avg.precision | 0.780372 |
| eval_weighted_avg.recall | 0.726919 |
| eval_weighted_avg.f1-score | 0.751985 |
| eval_weighted_avg.support | 35418 |
| eval_micro_avg.precision | 0.783601 |
| eval_micro_avg.recall | 0.726919 |
| eval_micro_avg.f1-score | 0.754196 |
| eval_micro_avg.support | 35418 |
| eval_runtime | 1160.95 |
| eval_samples_per_second | 52.292 |
| eval_steps_per_second | 26.147 |
Fold 2
| metrics | |
|---|---|
| eval_loss | 0.00431787 |
| eval_overall_precision | 0.782074 |
| eval_overall_recall | 0.712519 |
| eval_overall_f1 | 0.745678 |
| eval_overall_accuracy | 0.933556 |
| eval_macro_avg.precision | 0.667987 |
| eval_macro_avg.recall | 0.58284 |
| eval_macro_avg.f1-score | 0.620279 |
| eval_macro_avg.support | 36959 |
| eval_weighted_avg.precision | 0.778731 |
| eval_weighted_avg.recall | 0.712519 |
| eval_weighted_avg.f1-score | 0.743216 |
| eval_weighted_avg.support | 36959 |
| eval_micro_avg.precision | 0.782074 |
| eval_micro_avg.recall | 0.712519 |
| eval_micro_avg.f1-score | 0.745678 |
| eval_micro_avg.support | 36959 |
| eval_runtime | 1255.45 |
| eval_samples_per_second | 52.914 |
| eval_steps_per_second | 26.457 |
Fold 3
| metrics | |
|---|---|
| eval_loss | 0.00460493 |
| eval_overall_precision | 0.804563 |
| eval_overall_recall | 0.688634 |
| eval_overall_f1 | 0.742098 |
| eval_overall_accuracy | 0.932954 |
| eval_macro_avg.precision | 0.694945 |
| eval_macro_avg.recall | 0.561292 |
| eval_macro_avg.f1-score | 0.617365 |
| eval_macro_avg.support | 36311 |
| eval_weighted_avg.precision | 0.798574 |
| eval_weighted_avg.recall | 0.688634 |
| eval_weighted_avg.f1-score | 0.737633 |
| eval_weighted_avg.support | 36311 |
| eval_micro_avg.precision | 0.804563 |
| eval_micro_avg.recall | 0.688634 |
| eval_micro_avg.f1-score | 0.742098 |
| eval_micro_avg.support | 36311 |
| eval_runtime | 1230.3 |
| eval_samples_per_second | 51.72 |
| eval_steps_per_second | 25.86 |
Fold 4
| metrics | |
|---|---|
| eval_loss | 0.00429184 |
| eval_overall_precision | 0.779286 |
| eval_overall_recall | 0.728163 |
| eval_overall_f1 | 0.752858 |
| eval_overall_accuracy | 0.936848 |
| eval_macro_avg.precision | 0.66828 |
| eval_macro_avg.recall | 0.604989 |
| eval_macro_avg.f1-score | 0.632891 |
| eval_macro_avg.support | 35639 |
| eval_weighted_avg.precision | 0.777074 |
| eval_weighted_avg.recall | 0.728163 |
| eval_weighted_avg.f1-score | 0.75088 |
| eval_weighted_avg.support | 35639 |
| eval_micro_avg.precision | 0.779286 |
| eval_micro_avg.recall | 0.728163 |
| eval_micro_avg.f1-score | 0.752858 |
| eval_micro_avg.support | 35639 |
| eval_runtime | 1220.92 |
| eval_samples_per_second | 52.255 |
| eval_steps_per_second | 26.128 |
Average
Average of the folds with range(plus/minus 1.96 std, ~95% CI):
| metrics | |
|---|---|
| eval_loss | 0.00 ± 0.00 |
| eval_ANG.precision | 0.81 ± 0.04 |
| eval_ANG.recall | 0.77 ± 0.04 |
| eval_ANG.f1 | 0.79 ± 0.04 |
| eval_ANG.number | 237, 243, 219, 243 |
| eval_DUC.precision | 0.69 ± 0.02 |
| eval_DUC.recall | 0.57 ± 0.04 |
| eval_DUC.f1 | 0.62 ± 0.02 |
| eval_DUC.number | 597, 555, 563, 551 |
| eval_EVE.precision | 0.61 ± 0.05 |
| eval_EVE.recall | 0.58 ± 0.03 |
| eval_EVE.f1 | 0.59 ± 0.01 |
| eval_EVE.number | 792, 937, 791, 833 |
| eval_FAC.precision | 0.62 ± 0.04 |
| eval_FAC.recall | 0.47 ± 0.03 |
| eval_FAC.f1 | 0.53 ± 0.01 |
| eval_FAC.number | 1399, 1393, 1302, 1321 |
| eval_GPE.precision | 0.86 ± 0.01 |
| eval_GPE.recall | 0.82 ± 0.01 |
| eval_GPE.f1 | 0.84 ± 0.01 |
| eval_GPE.number | 8524, 8840, 8537, 8483 |
| eval_INFORMAL.precision | 0.00 ± 0.00 |
| eval_INFORMAL.recall | 0.00 ± 0.00 |
| eval_INFORMAL.f1 | 0.00 ± 0.00 |
| eval_INFORMAL.number | 10, 15, 12, 14 |
| eval_LOC.precision | 0.66 ± 0.02 |
| eval_LOC.recall | 0.56 ± 0.03 |
| eval_LOC.f1 | 0.61 ± 0.02 |
| eval_LOC.number | 1631, 1644, 1626, 1644 |
| eval_MISC.precision | 0.70 ± 0.02 |
| eval_MISC.recall | 0.59 ± 0.01 |
| eval_MISC.f1 | 0.64 ± 0.01 |
| eval_MISC.number | 2398, 2535, 2575, 2616 |
| eval_ORG.precision | 0.74 ± 0.02 |
| eval_ORG.recall | 0.67 ± 0.02 |
| eval_ORG.f1 | 0.71 ± 0.01 |
| eval_ORG.number | 6265, 6411, 6415, 6326 |
| eval_PER.precision | 0.88 ± 0.00 |
| eval_PER.recall | 0.83 ± 0.02 |
| eval_PER.f1 | 0.85 ± 0.01 |
| eval_PER.number | 6946, 7451, 7394, 7148 |
| eval_TIMEX.precision | 0.82 ± 0.01 |
| eval_TIMEX.recall | 0.75 ± 0.02 |
| eval_TIMEX.f1 | 0.78 ± 0.01 |
| eval_TIMEX.number | 3082, 3216, 3126, 2969 |
| eval_TTL.precision | 0.69 ± 0.04 |
| eval_TTL.recall | 0.63 ± 0.05 |
| eval_TTL.f1 | 0.66 ± 0.01 |
| eval_TTL.number | 2815, 2980, 3025, 2768 |
| eval_WOA.precision | 0.69 ± 0.03 |
| eval_WOA.recall | 0.41 ± 0.03 |
| eval_WOA.f1 | 0.52 ± 0.02 |
| eval_WOA.number | 722, 739, 726, 723 |
| eval_overall_precision | 0.79 ± 0.01 |
| eval_overall_recall | 0.71 ± 0.02 |
| eval_overall_f1 | 0.75 ± 0.01 |
| eval_overall_accuracy | 0.93 ± 0.00 |
| eval_macro_avg.precision | 0.67 ± 0.01 |
| eval_macro_avg.recall | 0.59 ± 0.02 |
| eval_macro_avg.f1-score | 0.63 ± 0.01 |
| eval_macro_avg.support | 35418.0, 36959.0, 36311.0, 35639.0 |
| eval_weighted_avg.precision | 0.78 ± 0.01 |
| eval_weighted_avg.recall | 0.71 ± 0.02 |
| eval_weighted_avg.f1-score | 0.75 ± 0.01 |
| eval_weighted_avg.support | 35418.0, 36959.0, 36311.0, 35639.0 |
| eval_micro_avg.precision | 0.79 ± 0.01 |
| eval_micro_avg.recall | 0.71 ± 0.02 |
| eval_micro_avg.f1-score | 0.75 ± 0.01 |
| eval_micro_avg.support | 35418.0, 36959.0, 36311.0, 35639.0 |
| eval_runtime | 1216.91 ± 40.05 |
| eval_samples_per_second | 52.292, 52.914, 51.72, 52.255 |
| eval_steps_per_second | 26.147, 26.457, 25.86, 26.128 |
| fold | fold_1, fold_2, fold_3, fold_4 |