bikit

The building inspection toolkit (bikit) is a simple to use data and model hub containing relevant open-source datasets in the field of damage recognition including the according baselines harmonized into one module. The bikit's datasets are enriched with evaluation splits and predefined metrics, suiting the specific task and their data distribution. The bikit paper is available here.

Datasets

So far, there are two open-source datasets suitable for multi-target classification of damage occurring on massive constructions that are available via bikit: MCDS and CODEBRIM

MCDS

The MCDS dataset was created for a three-stage approach where, depending on the results of the first model, a second model would be applied and, depending on that, a third. This dataset was recently transformed into a single-stage approach, where data acquisition procedure was neglected. The second and third stages determine whether exposed rebars and rust staining is visible. The negative samples have been drawn from stage-one dataset. So, only a few images without exposed rebars had the label “no exposed bars” and the same applies for “no rust”. We cleaned the dataset and transferred it to an eight-class dataset. Due to the fact that no data splits were provided, we introduced fix splits for training (2057 samples), validation (270), and testing (270). Further details can be accessed here

CODEBRIM

The CODEBRIM dataset is the second richest dataset in terms of damage categories. It is missing two classes in comparison to MCDS but comes with a much higher number of unique image counts. The authors also made splits available so that all former research is comparable and no new splits need to be created. The numbers for each split are 6013 (training), 616 (validation), and 632 (testing). We use the balanced version for our experiments. Since we had difficulties to extract the provided ZIP file, we made a new one available within bikit. Further details can be accessed here

Classes	MCDS	CODEBRIM
No Damage	452	2,506
Crack	787	2507
Efflorescence	304	833
Spalling	422	1898
Exposed Rebars	221	1507
Rust	350	1559
Scaling	163	-
Other damages	264	-
Number of images	2597	7261
Average damage/image	1.14	1.13
Table 1. Counts of classes, number of images and average damage per image in MCDS and CODEBRIM dataset.

Models

Currently, all models appearing on the leaderboard are available via bikit. We would also highly appreciate and encourage everyone to submit a model and allow its distribution via bikit. In the initial stage, we provide architectures based on: ResNet-50 (RN), EfficientNetV1-B0 (EN), and MobileNetV3-Large (MN).

Dataset	Approach	Base	EMR	NoDam.	Crack	Spall.	Effl.	Expos.	Rust	Scal.	Other
CODEBRIM	HTA	RN	73.73	94.67	88.00	84.00	75.84	88.67	79.33	-	-
CODEBRIM	HTA	MN	69.46	94.00	78.67	83.33	69.80	88.67	76.67	-	-
CODEBRIM	DHB	EN	68.67	92.00	83.33	88.67	74.50	92.67	85.33	-	-
MCDS	HTA	MN	54.44	70.00	76.67	58.89	90.00	21.67	68.33	43.33	46.67
MCDS	DHB	EN	51.85	46.67	73.33	61.11	80.00	38.33	75.00	43.33	46.67
MCDS	DHB	RN	48.15	66.67	73.33	44.44	86.67	23.33	65.00	36.67	43.33
Table 2. EMR and recall by class of the best models trained on MCDS and CODEBRIM dataset in percent.

Metrics

On the one hand, authors of MCDS chose to provide a huge number of metrics. They show 10 different metrics for all three stages and for all classes. On the other hand, CODEBRIM authors chose to report Exact Match Ratio (EMR) only. Later analysis chose AUROC, accuracy, F1-score, precision, and Recall on an aggregated level (not on class level). We have chosen a middle ground between one overall metric and 10 metrics for each class. The main metric for multi-target classification on dacl.ai is EMR but we also report F1-score. Besides that, we provide recall by class and make both metrics accessible in the leaderboard. Again: To be ranked on the leaderboard, a CSV-file must be uploaded that contains the binarized results on the test set of any dataset tackling the problem of damage recognition on built structures. To contribute click the Submit button in the upper right of the page.

EMR

While EMR is not a metric created directly for unbalanced data, it is complex enough so that the distribution problem is adequately addressed. A side effect is that current values are still not close to perfect fit, which leaves room for improvement within this underlying research domain.

F1-score

The F1-score is the harmonic mean of precision and recall. It especially works well on imbalanced data. You can find more info about the F1-score here.

Recall classwise

The classwise Recall shows the sensitivity or in other words, the number of times a specific damage was correctly recognized divided by the amount of the damage's positive labels. For engineers it is mainly relevant to see how many damage are overlooked. In current approaches, machine learning-based software is made to support and not to replace engineers. Consequently, it is more vital to detect all damage which exist rather than missing risky damages. Due to that fact, we decided to provide recall measures for all classes, in order to be able to see where current models often still fail.