The building inspection toolkit (bikit) is a simple to use data and model hub containing relevant open-source datasets in the field of damage recognition including the according baselines harmonized into one module. The bikit's datasets are enriched with evaluation splits and predefined metrics, suiting the specific task and their data distribution. The bikit paper is available here.


So far, there are two open-source datasets suitable for multi-target classification of damage occurring on massive constructions that are available via bikit: MCDS and CODEBRIM


The MCDS dataset was created for a three-stage approach where, depending on the results of the first model, a second model would be applied and, depending on that, a third. This dataset was recently transformed into a single-stage approach, where data acquisition procedure was neglected. The second and third stages determine whether exposed rebars and rust staining is visible. The negative samples have been drawn from stage-one dataset. So, only a few images without exposed rebars had the label “no exposed bars” and the same applies for “no rust”. We cleaned the dataset and transferred it to an eight-class dataset. Due to the fact that no data splits were provided, we introduced fix splits for training (2057 samples), validation (270), and testing (270). Further details can be accessed here


The CODEBRIM dataset is the second richest dataset in terms of damage categories. It is missing two classes in comparison to MCDS but comes with a much higher number of unique image counts. The authors also made splits available so that all former research is comparable and no new splits need to be created. The numbers for each split are 6013 (training), 616 (validation), and 632 (testing). We use the balanced version for our experiments. Since we had difficulties to extract the provided ZIP file, we made a new one available within bikit. Further details can be accessed here

No Damage 452 2,506
Crack 787 2507
Efflorescence 304 833
Spalling 422 1898
Exposed Rebars 221 1507
Rust 350 1559
Scaling 163 -
Other damages 264 -
Number of images 2597 7261
Average damage/image 1.14 1.13
Table 1. Counts of classes, number of images and average damage per image in MCDS and CODEBRIM dataset.


Currently, all models appearing on the leaderboard are available via bikit. We would also highly appreciate and encourage everyone to submit a model and allow its distribution via bikit. In the initial stage, we provide architectures based on: ResNet-50 (RN), EfficientNetV1-B0 (EN), and MobileNetV3-Large (MN).

Dataset Approach Base EMR NoDam. Crack Spall. Effl. Expos. Rust Scal. Other
CODEBRIM HTA RN 73.73 94.67 88.00 84.00 75.84 88.67 79.33 - -
CODEBRIM HTA MN 69.46 94.00 78.67 83.33 69.80 88.67 76.67 - -
CODEBRIM DHB EN 68.67 92.00 83.33 88.67 74.50 92.67 85.33 - -
MCDS HTA MN 54.44 70.00 76.67 58.89 90.00 21.67 68.33 43.33 46.67
MCDS DHB EN 51.85 46.67 73.33 61.11 80.00 38.33 75.00 43.33 46.67
MCDS DHB RN 48.15 66.67 73.33 44.44 86.67 23.33 65.00 36.67 43.33
Table 2. EMR and recall by class of the best models trained on MCDS and CODEBRIM dataset in percent.


On the one hand, authors of MCDS chose to provide a huge number of metrics. They show 10 different metrics for all three stages and for all classes. On the other hand, CODEBRIM authors chose to report Exact Match Ratio (EMR) only. Later analysis chose AUROC, accuracy, F1-score, precision, and Recall on an aggregated level (not on class level). We have chosen a middle ground between one overall metric and 10 metrics for each class. The main metric for multi-target classification on is EMR but we also report F1-score. Besides that, we provide recall by class and make both metrics accessible in the leaderboard. Again: To be ranked on the leaderboard, a CSV-file must be uploaded that contains the binarized results on the test set of any dataset tackling the problem of damage recognition on built structures. To contribute click the Submit button in the upper right of the page.


While EMR is not a metric created directly for unbalanced data, it is complex enough so that the distribution problem is adequately addressed. A side effect is that current values are still not close to perfect fit, which leaves room for improvement within this underlying research domain.


The F1-score is the harmonic mean of precision and recall. It especially works well on imbalanced data. You can find more info about the F1-score here.

Recall classwise

The classwise Recall shows the sensitivity or in other words, the number of times a specific damage was correctly recognized divided by the amount of the damage's positive labels. For engineers it is mainly relevant to see how many damage are overlooked. In current approaches, machine learning-based software is made to support and not to replace engineers. Consequently, it is more vital to detect all damage which exist rather than missing risky damages. Due to that fact, we decided to provide recall measures for all classes, in order to be able to see where current models often still fail.