journal article
LitStream Collection
Deep neural networks for analysis of fisheries surveillance video and automated monitoring of fish discards
French, Geoff;Mackiewicz, Michal;Fisher, Mark;Holah, Helen;Kilburn, Rachel;Campbell, Neil;Needle, Coby;
doi: 10.1093/icesjms/fsz149pmid: N/A
Abstract We report on the development of a computer vision system that analyses video from CCTV systems installed on fishing trawlers for the purpose of monitoring and quantifying discarded fish catch. Our system is designed to operate in spite of the challenging computer vision problem posed by conditions on-board fishing trawlers. We describe the approaches developed for isolating and segmenting individual fish and for species classification. We present an analysis of the variability of manual species identification performed by expert human observers and contrast the performance of our species classifier against this benchmark. We also quantify the effect of the domain gap on the performance of modern deep neural network-based computer vision systems. Introduction The quantity of fish discards on-board fishing trawlers is currently estimated via measurements obtained during on-board observer sampling. The quantity of discard data is therefore limited by the availability and cost of the observers. In contrast, more precise measurements of the quantity of catch landed at port are available as it is weighed to ensure compliance with the trawlers individual quota. Quota is assigned according to the total allowable catch quota established by the Common Fisheries Policy of the European Union. A pilot catch quota management scheme (CQMS) in the UK aimed to improve the quality of discard estimations by installing electronic monitoring systems on-board participating trawlers within the Scottish demersal fishing fleet. These systems included video surveillance cameras monitoring the conveyor belts on which fish are processed or discarded. Marine Scotland Science analysts reviewed the numbers, sizes, and species of fish caught per vessel by sampling each vessel’s video record when it returned to port (Needle et al., 2014). Manually counting, measuring and identifying the species of the discarded fish has proved to be laborious and time consuming, motivating the development of a computer vision system designed to analyse the footage automatically. The intended end result of the project is a system that supports the experts by automating as much of the tedious and expensive manual analysis as possible. We can therefore outline the main requirements of the computer vision component of the system; first to detect and count fish leaving the discard chute and second to classify and measure a subset of commercial species. Such a system must be robust to the multiple occlusions and unstructured scenes that arise in the unconstrained environment of a commercial fishing trawler; fish are randomly oriented and frequently occlude one another and the view of the working area may be occluded by fishers processing the catch (see Figure 1). Figure 1. View largeDownload slide Images from each vessel. (a) Vessel A, (b) Vessel B, (c) Vessel C, (d) Vessel D, and (e) Vessel R. Figure 1. View largeDownload slide Images from each vessel. (a) Vessel A, (b) Vessel B, (c) Vessel C, (d) Vessel D, and (e) Vessel R. Deep neural networks have established state-of-the-art results in computer vision problems including image classification, object detection, and image segmentation. Their impressive performance however comes at the cost of requiring large quantities of annotated training data (Lin et al., 2014; Russakovsky et al., 2015). We review a body of prior work in the “Background” section. We discuss prior work in automated analysis of fishing data and work that underpins the computer vision components of our system. The experiments that we performed required a body of training data consisting of images extracted from the video footage along with precise ground truth annotations. The dataset was developed in collaboration with observer experts at Marine Scotland, using a web-based annotation tool developed for this task. The dataset and the tool are described in the “Dataset and data acquisition tools” section. We use instance segmentation to isolate individual fish within an image. This component is discussed in the “Instance segmentation” section. We refer our earlier work in French et al. (2015) that focuses on segmentation and discuss the more modern Mask R-CNN (He et al., 2017) instance segmentation approach that we have adopted in its place. The fish that are detected and isolated by the segmentation system are passed to a species classifier for identification. The development of this classifier and its performance is discussed in the “Species identification” section. To assess the performance of our classifier we conducted an experiment in which 8 expert human observers were asked to identify the species of 250 fish that were extracted from the surveillance footage. We analyse the variability of expert human observers and contrast the performance of our classifier against this benchmark in the “Inter-observer variability experiment” section. The future directions of this work can be found in the “Conclusions and future work” section. Background In this section, we discuss the computer vision research that we consider relevant from the point of view of addressing our objectives. We will specifically refer to the requirements we set out in the “Introduction” section. Computer vision for fish classification The first attempts to apply computer vision to the problem of fish classification were reported in the 1980s by Tayama et al. (1982), who used shape descriptors derived from binary silhouettes to discriminate between 9 fish species with 90% accuracy. Further work combined colour and shape descriptors (Strachan, 1993) achieving a reliability of 100% and 98% in identifying 23 species under laboratory conditions. It involved a mechanical feeding system to ensure that individual fish are correctly oriented and presented to the camera one-by-one, along with tightly controlled lighting. The author notes potential caveats due to seasonal changes in the physical condition of fish and variability in the colour of individual specimens, depending to some extent on the area in which they are caught. This issue is highly likely to affect our system too. Further work refined approaches for fish species classification using primarily shape and colour features with fuzzy classifiers and neural networks (Hu et al., 1998; Storbeck and Daan, 2001; Alsmadi et al., 2009). White et al. (2006) describe trials of CatchMeter; a sorting machine capable of measuring and classifying fish based on colour and shape features that achieves fish length measurement accuracy of σ=1,2 mm and species classification accuracy of flat- and round-fish of ∼99%. Specimens must be presented individually, but can be in any orientation. Later research investigates colour, shape, and texture features and more advanced classifiers but still requiring constrained environments avoiding occlusion. As a consequence, counting individuals is trivial or irrelevant (Hu et al., 2012). However, a recent review of computer vision in aquaculture and processing of fish products identifies a wide range of applications for the technology at all stages of production (Mathiassen et al., 2011; Zion, 2012), many of which present challenging problems for computer vision. Successfully classifying images captured in real-life conditions requires the use of more sophisticated approaches such as non-rigid part models (Chuang et al., 2016). Deep neural network-based feature extractors have been successfully employed for fish species identification on the Fish4Knowledge (Boom et al., 2012), using unsupervised learning to initialize the network layers (Qin et al., 2016; Sun et al., 2016). More recent work employs deep neural network image classifiers trained in an end-to-end fashion (Zheng et al., 2018), tackling a challenging Kaggle dataset in which equipment and personnel are present in the images, in addition to the fish. Image classification In recent years deep neural networks have set a number of state-of-the-art image classification results. A variety of architectures have been proposed (Krizhevsky et al., 2012; Simonyan and Zisserman, 2015) with residual networks (He et al., 2016) combining strong performance with computational efficiency. Practitioners frequently employ transfer learning (Donahue et al., 2014; Long et al., 2015) in which a pre-trained ImageNet classifier (e.g. a residual network) is adapted for a new classification task by replacing the final layer and fine tuning. It is worth noting that deep neural networks are prone to overfitting (Krizhevsky et al., 2012) and will often exhibit poor performance on data drawn from a different distribution to that on which they are trained. It is for this reason that it is important to maximize the diversity of the training set by using as wider variety of lighting and image capture conditions as possible. In situations where the annotated training images and evaluation images are drawn from different distributions or sources, the difference between them is referred to as the domain gap. In such situations we expect the network to perform poorly on the target/evaluation domain. The field of domain adaptation (Saenko et al., 2010; French et al., 2018) is aimed at finding solutions to these problems. Typical domain adaptation problems involve learning from annotated synthetic images and unannotated real-life images, with a view to maximizing performance on the real-life data. In surveillance situations where data are obtained from a number of cameras, a small domain gap can be said to exist between the cameras due to the different lighting conditions and perspective of each camera. Instance segmentation Image segmentation is the process by which an image is segmented into regions, often on a per-pixel basis. In this work, we focus on instance segmentation as our goal is to locate and isolate individual fish within an image. Instance segmentation algorithms can be divided into two classes based on how they tackle the problem. The first approach combines semantic segmentation with contour detection. Semantic segmentation (Long et al., 2015; Ronneberger et al., 2015) classifies each pixel according to the type of object covering it (fish, conveyor belt, detritus, etc.). Multiple objects of the same class that touch or overlap will form a contiguous region, as occurs frequently in our CCTV footage when fish overlap. Contour detection (Xie and Tu, 2015) locates edges of objects that are used to guide the Watershed algorithm (Beucher and Meyer, 1993) to split these regions, separating individual objects. This was the approach adopted in our earlier work (French et al., 2015). In practice this is often unreliable. False negatives in the contour predictions result in small gaps that prevent instances from being separated due to the flood-fill based approach of the Watershed algorithm. False-positive contour detections can result in the complementary problem of over-segmentation. Our prior work had to train separate segmentation models for each conveyor belt (due to the aforementioned domain gap) and use carefully tuned post-processing to mitigate this problem. The second approach to instance segmentation combines object detection and boundary localization. Object detection systems detect and locate objects within an image, typically predicting a bounding box and class category for each detected object. The instance level segmentation is generated by predicted object boundaries, often in the form of a mask that identifies the regions of the image that belong to the object in question. This is the approach adopted by Mask R-CNN (He et al., 2017). They combine Faster R-CNN (Ren et al., 2015) object detection algorithm—an accurate two-stage object detection algorithm—with a mask prediction module that predicts a low resolution mask (normally 28 × 28 pixels) that is scaled to fit the bounding box and identifies the parts of the image covered by the detected object. Dataset and data acquisition tools Marine Scotland provided us with the surveillance footage that was gathered during their CQMS pilot study (Needle et al., 2014). In its raw form it was not suitable to be directly processed by our computer vision system. In this section, we discuss the process that we developed to extract usable image data from the CCTV video that could be annotated, allowing us to train and evaluate the machine learning components of our system. We will discuss the source video material, the project web application, calibration, and segmentation dataset selection and preparation. Video sources The surveillance footage was captured in 800p HD resolution and stored in MPEG-4 format. The videos come from five sources; four commercial fishing vessels; and one research vessel operated by Marine Scotland. The footage from the commercial vessels captures the real-world working environment and presents challenging conditions, including occlusions by personnel working at the conveyor belt and the view being obscured by spatter on the dome that covers the camera. The footage from the research vessel is similar in terms of content and layout but provides the opportunity to capture tailor-made footage for the purpose of gathering training data. The footage from the commercial vessels consists of the mix of species that was being processed on board the vessel at the time of capture. The footage from the research vessel was specifically produced by Marine Scotland staff by placing large numbers of fish of a known species on the conveyor belt and running it past the camera. Each video from the research vessel contains fish of a single species; this was done for the purpose of training the species classifier, discussed in the “Training data” section. The footage is summarized in Table 1. Example frames are shown in Figure 1. Table 1. Summary of video footage. Vessels Types No of videos Running time (HH:MM:SS) Vessel A Commercial 38 37:30:47 Vessel B Commercial 23 22:45:41 Vessel C Commercial 26 20:38:26 Vessel D Commercial 25 24:26:56 Vessel R Research 53 6:18:41 Total Vessels Types No of videos Running time (HH:MM:SS) Vessel A Commercial 38 37:30:47 Vessel B Commercial 23 22:45:41 Vessel C Commercial 26 20:38:26 Vessel D Commercial 25 24:26:56 Vessel R Research 53 6:18:41 Total Table 1. Summary of video footage. Vessels Types No of videos Running time (HH:MM:SS) Vessel A Commercial 38 37:30:47 Vessel B Commercial 23 22:45:41 Vessel C Commercial 26 20:38:26 Vessel D Commercial 25 24:26:56 Vessel R Research 53 6:18:41 Total Vessels Types No of videos Running time (HH:MM:SS) Vessel A Commercial 38 37:30:47 Vessel B Commercial 23 22:45:41 Vessel C Commercial 26 20:38:26 Vessel D Commercial 25 24:26:56 Vessel R Research 53 6:18:41 Total Web application To facilitate collaboration between Marine Scotland and University of East Anglia personnel, a web application was developed using the Django Framework (https://djangoproject.com). The website allows Marine Scotland staff to upload CCTV footage and annotate images for training our computer vision systems (see “Image annotation” section). It was extended to support the inter-observer species identification variability experiment discussed in the “Performance evaluation” section. Belt extraction and calibration We simplified the task of processing the footage by extracting a region of interest covering the conveyor belt, thereby excluding equipment, people and the boat interior, as can be seen in Figure 1. We used a perspective transformation to extract the conveyor belt and transform it into rectilinear space (see Figure 2) with a constant uniform physical distance to image space ratio. Figure 2. View largeDownload slide The belt extraction and calibration process: (a) checkerboard on belt, (b) with lens distortion removed, and (c) with perspective warp used to transform belt into rectilinear space and exterior cropped out. Figure 2. View largeDownload slide The belt extraction and calibration process: (a) checkerboard on belt, (b) with lens distortion removed, and (c) with perspective warp used to transform belt into rectilinear space and exterior cropped out. Lens distortion correction The surveillance cameras on-board fishing vessels frequently use fish-eye lenses to increase field of view. This introduces a curved distortion to the image that complicates later stages of the system. The OpenCV library (Bradski, 2000) provides functionality for automatically estimating lens distortion parameters and removing it from images. The lens distortion estimation algorithm within OpenCV requires that a printed checkerboard pattern is captured at various positions within the cameras field of view. Its corners are detected and their positions are used to estimate the lens distortion. We provided Marine Scotland staff with a checkerboard pattern and a procedure for capturing calibration footage on-board fishing vessels. Extracting the checkerboard from all frames in which it is visible typically results in several hundred detections. The lens distortion estimation algorithm run-time scales in a super-linear fashion with respect to the number of detections used, failing to complete within a reasonable time. We opted to select a subset of the detections that significantly differ from one another. We divide the image into 50 × 50 pixel cells and quantize the coordinates of the checkerboard corners, generating a map that specifies which cells are covered (histogram2D in the algorithm). If >22% (determined by trial and error) of the cells covered by this checkerboard have not been covered by a previously selected checkerboard we add it to our selection. The algorithm is given below: Algorithm 1 Lens estimation detection selection algorithm covered ← booleanArr2D(num_cells_y, num_cells_x) selected_dets←[] for each det←detectionsdo det_coverage ← Histogram2D(det, cell_size) ifmean(det_coverage ∧ ¬covered) ≥ 22% then covered←covered∨det_coverage selected_dets.Append(det) end if end for Belt warping The checkerboard used for estimating lens distortion parameters was printed on A3 paper, giving it known physical dimensions. The checkerboard was placed on the conveyor belt and captured as part of the calibration process. The checkerboard localization algorithm within OpenCV is used to find the checkerboard, after which a perspective transformation is estimated to transform the checkerboard into a fixed rectangular size. Applying this transformation to the image of the belt removes the perspective distortion and scales the image of the belt to a known physical distance to image space ratio. A tool was developed within Jupyter Notebook (Kluyver et al., 2016) that allows the user to correct for any misalignment and crop the region corresponding to the belt. Complete belt extraction process We use the estimated lens parameters to compute a mapping. For each pixel in the straightened image the mapping provides its coordinates in the distorted image. The perspective transformation used for belt extraction can also be expressed as a mapping. We therefore compute a composite mapping that combines both the distortion removal and perspective transformation in a single step. The composite mapping is generated once and used for each image or frame that must be processed. The mapping can be applied to an image using GPU accelerated texture map lookups and typically takes <2 ms on a desktop machine. Segmentation and species ID training set A segmentation dataset consisting of still frames extracted from the video footage was required to train and evaluate the segmentation system. The conveyor belt moves in irregular and unpredictable short bursts and is controlled by on-board personnel. We wished to extract frames such that the belt moves by at least half the length of the visible region of the belt to ensure that the content changes sufficiently between frames extracted for the training set. This required a robust estimate of the belt motion. We should note that there is overlap between successive frames, so some individual fish are visible in more than one training set frame. Belt motion estimation Extracting the belt from the image and transforming it into rectilinear space simplifies the task of estimating belt motion between frames as its motion is constrained to horizontal translation. A natural choice for this would be enhanced correlation coefficient-based image alignment (Evangelidis and Psarakis, 2008), an implementation of which is provided by OpenCV. Unfortunately this algorithm is often confused by the repeating texture present on the conveyor belts in our footage. We developed a more robust solution based on correlation of neural network features. While inter-frame correlation between RGB or greyscale pixel data was sufficient to detect motion, it did not accurately quantify it. To precisely quantify the motion we computed the correlation between features extracted using the convolution layers of a pre-trained VGG-16 (Simonyan and Zisserman, 2015) network instead of RGB pixel values. We found that later layers of the network would yield more accurate motion estimates, but at reduced resolution. Once correlation using RGB pixel values indicated motion, features were extracted from the pool4, pool3, and pool2 layers of VGG-16. The pool4 feature correlations provided an accurate estimate of motion, but at 1/16 resolution. Correlation between pool3 at 1/8 resolution was computed and their output constrained so as to refine the motion estimate from pool4. Further refinements were obtained using features from pool2, after which final refinements were calculated using RGB pixel value correlation. Our implementation uses the pre-trained VGG-16 network provided by the torchvision library that is part of PyTorch (Paszke et al., 2017). Image annotation The images selected for segmentation were uploaded to the web application after which they were manually annotated by Marine Scotland staff. Within this application the labelling tool allows the user to draw polygonal annotations and classify them (http://bitbucket.org/ueacomputervision/image-labelling-tool). The user can select from 15 species of fish and several non-fish classes such as person, belt structure, or guts. There are also classes used to indicate unidentifiable fish or material. The labelling tool can be seen in Figure 3. Figure 3. View largeDownload slide Web-based segmentation annotation tool. Figure 3. View largeDownload slide Web-based segmentation annotation tool. Manually annotating fish by drawing polygonal labels is a labour intensive task. We were able to considerably reduce the labelling effort required by partially automating this process. Once between 100 and 200 images had been manually annotated for each belt, we found that a segmentation model trained using these annotations was able to automatically annotate the majority of fish to a satisfactory standard. We generated automatic annotations for as-of-yet unannotated images and placed them on the website to serve as a starting point for the annotators. This saved considerable effort as the annotators only needed to annotate the few fish that had been missed or fix mistakes. The improved annotations could then be added to the training set that was used to train a new and more accurate segmentation model, resulting in a cyclic process. Data The training data for the segmentation system consists of 902 annotated frames drawn from videos from the five vessels and is summarized in Table 2. While many more frames were extracted, this is the subset that has been annotated so far. Table 2. Segmentation training set. Vessels No of annotated images No of annotated fish Vessel A 204 1 459 Vessel B 263 1 254 Vessel C 145 1 588 Vessel D 153 4 809 Vessel R 137 1 498 Total 902 10 608 Vessels No of annotated images No of annotated fish Vessel A 204 1 459 Vessel B 263 1 254 Vessel C 145 1 588 Vessel D 153 4 809 Vessel R 137 1 498 Total 902 10 608 Table 2. Segmentation training set. Vessels No of annotated images No of annotated fish Vessel A 204 1 459 Vessel B 263 1 254 Vessel C 145 1 588 Vessel D 153 4 809 Vessel R 137 1 498 Total 902 10 608 Vessels No of annotated images No of annotated fish Vessel A 204 1 459 Vessel B 263 1 254 Vessel C 145 1 588 Vessel D 153 4 809 Vessel R 137 1 498 Total 902 10 608 Instance segmentation An effective instance segmentation algorithm is a pre-requisite to the successful operation of the complete system as later stages rely on accurate detection and segmentation to reliably classify the species of individual fish and estimate length and mass. During the course of the project we experimented with a variety of approaches to solving this problem. Our first attempt used semantic segmentation to identify regions of the image containing fish and subsequently split them into individuals using contour detection. By using a separate segmentation model for each conveyor belt and finely tuned post-processing we were able to achieve some success using VGA resolution footage (French et al., 2015). This process proved unreliable when applied to higher resolution HD footage as false negatives from the contour detector would prevent separation of individual fish from one another. Mask R-CNN (He et al., 2017) proved to be an effective and efficient instance segmentation algorithm, hence we adopted it for use in our system [We use the COCO (Lin et al., 2014) pre-trained implementation of Mask R-CNN provided by the torchvision library that is developed by the PyTorch (Paszke et al., 2017) team. It produces good results and trains quickly.]. As stated in the “Background” section, it combines object detection with mask prediction and is therefore much more robust than our previous approach. It generates high quality labels as shown in Figure 4. Furthermore our segmentation model is trained on images from all vessels simultaneously. Figure 4. View largeDownload slide Instance segmentation applied to a frame from footage from Vessel R (research vessel); the outlined shapes are generated automatically. Figure 4. View largeDownload slide Instance segmentation applied to a frame from footage from Vessel R (research vessel); the outlined shapes are generated automatically. As stated in the “Image annotation” section, the segmentation system was used to automatically annotate images on the labelling tool section of the project web application, after which mistakes in the annotations could be fixed manually. We maximized the quality of the automatically generated annotations using test-time augmentation (He et al., 2017); each image was segmented eight times, with random augmentation consisting of horizontal and vertical flips, lightening and darkening, scaling and rotation. The resulting predictions were averaged, increasing their accuracy. Doing so comes at significant computational cost, so this is only feasible for offline use when accuracy outweighs run-time performance. Separate species identification The object detection network that forms the basis of Mask R-CNN (He et al., 2017) incorporates a classifier that identifies detected objects and a multi-class mask head that learns class-specific shapes for segmentation. In principle this could be used to perform fish detection, segmentation, and species identification in a single pass. In spite of this we opt to use separate networks, using a single class Mask R-CNN network for only fish detection and segmentation. We do this for several reasons that we will now explain. Identifying the species of fish in our surveillance footage requires annotators with the relevant training and experience. In contrast outlining individual fish for segmentation can be performed by a wide variety of individuals. To support this we allow annotators to outline fish in an image without specifying their species. As a consequence many images in our dataset have fish outlined for segmentation but with some individuals having no assigned species. Training a multi-class Mask R-CNN model requires per-object class labels to select the class-specific bounding box regressor and mask head to optimize for each object. As a consequence images with partial species annotation would not be usable for training a multi-class Mask R-CNN model. Furthermore, as stated in the “Classifier” section we were able to improve the performance of our classifier by rotating the images of segmented fish so that they lie horizontally, as doing so eliminates a source of irrelevant variation. Mask R-CNN does not provide a mechanism for altering the orientation of objects prior to classification. For these reasons we train our Mask R-CNN model to detect and segment objects of a single fish class and identify species in a subsequent step. Training procedure Data augmentation artificially expands the training set by modifying the existing image samples to increase variability and is frequently used to improve performance (Krizhevsky et al., 2012; He et al., 2016). While training our segmentation network we augment the images using random horizontal and vertical flips, random rotations between −45∘ and 45∘, applying a random uniform scale factor in the range of 0.8–1.25 and randomly modifying the brightness and contrast by multiplying the RGB values by a value drawn from eN(0,ln(0.1)) and adding a value drawn from N(0,0.1) . We split our dataset into 90% for training and 10% for validation. We train for 350 epochs with one epoch consisting of the iterations necessary to train using all training images. We report the mean average precision (mAP; Lin et al., 2014) score for the validation samples in our logs. We use the validation score for early stopping; we save the network state for use after the epoch at which it achieved the highest validation mAP score. We use a learning rate of 10−4 for the new randomly initialized later layers and 10−5 for the pre-trained layers that come from the torchvision (Paszke et al., 2017) Mask R-CNN implementation. We randomly crop 512 × 512 pixel regions from our rectilinear belt images and build mini-batches of crops from four randomly chosen images during training. We train our models on a single nVidia GeForce 1080-Ti GPU. In addition to the bounding box non-maximal suppression used in Mask R-CNN (He et al., 2017) we apply NMS to the masks predicted during inference. If >10% of the pixels predicted as belonging to object are already occupied by other objects with a higher predicted confidence, the lower scoring object is ignored. Species identification In this section we describe our species classifier, the development of the dataset required for training and our evaluation of the performance of our classifier. Classifier Our species classifier is a 50-layer residual network (He et al., 2016) adapted and fine tuned using transfer learning. It operates on images of individual fish that are identified by the instance segmentation system (see “Instance segmentation” section). We found careful pre-processing of images of individual fish to be essential for good classification performance. While the fish in our surveillance footage are arbitrarily oriented, we found that rotating images of individual fish so that they lie horizontally eliminated a source of irrelevant variation, improving accuracy. We used the regionprops function from the Scikit-Image (van der Walt et al., 2014) library to estimate the orientation from the shape/mask predicted for each fish and rotate it so that the longest axis lies horizontally. This ensures that most fish lie horizontally, although they vary in horizontal and vertical direction (left-to-right or right-to-left, upside-down). Given that the masks predicted by the segmentation system are often imperfect we found that expanding the mask in all directions by seven pixels (using binary dilation) improved performance. Each image was scaled to a constant size of 192 × 192 pixels and centred within a 256 × 256 image. Pixels outside of the masked to 0, removing any distracting cues from parts of the image outside the bounds of the fish. Training data Our species identification training data is drawn from footage from the commercial vessels and from the research vessel. A summary of the species identification training data broken down by vessel and species is given in Tables 3 and 4. Table 3. Summary of species identification dataset from commercial vessels. Vessel Cod Haddock Whiting Saithe Hake Monk Vessel A 47 109 12 370 116 1 Vessel B 21 89 25 9 9 7 Vessel C 19 229 23 70 31 2 Vessel D 12 258 42 21 16 Total 99 685 110 470 172 10 Vessel Cod Haddock Whiting Saithe Hake Monk Vessel A 47 109 12 370 116 1 Vessel B 21 89 25 9 9 7 Vessel C 19 229 23 70 31 2 Vessel D 12 258 42 21 16 Total 99 685 110 470 172 10 Table 3. Summary of species identification dataset from commercial vessels. Vessel Cod Haddock Whiting Saithe Hake Monk Vessel A 47 109 12 370 116 1 Vessel B 21 89 25 9 9 7 Vessel C 19 229 23 70 31 2 Vessel D 12 258 42 21 16 Total 99 685 110 470 172 10 Vessel Cod Haddock Whiting Saithe Hake Monk Vessel A 47 109 12 370 116 1 Vessel B 21 89 25 9 9 7 Vessel C 19 229 23 70 31 2 Vessel D 12 258 42 21 16 Total 99 685 110 470 172 10 Table 4. Summary of species identification dataset from the research vessel. Cod Haddock Whiting Saithe Hake Monk Mackerel No of fish 1 451 12 482 14 068 861 304 1 837 No of videos 3 18 13 2 2 1 Horse mackerel Norway pout Plaice Long rough dab Common dab Grey gurnard Red gurnard No of fish 496 5 574 2 402 1 495 1 601 1 599 65 No of videos 1 2 3 1 2 3 1 Cod Haddock Whiting Saithe Hake Monk Mackerel No of fish 1 451 12 482 14 068 861 304 1 837 No of videos 3 18 13 2 2 1 Horse mackerel Norway pout Plaice Long rough dab Common dab Grey gurnard Red gurnard No of fish 496 5 574 2 402 1 495 1 601 1 599 65 No of videos 1 2 3 1 2 3 1 Table 4. Summary of species identification dataset from the research vessel. Cod Haddock Whiting Saithe Hake Monk Mackerel No of fish 1 451 12 482 14 068 861 304 1 837 No of videos 3 18 13 2 2 1 Horse mackerel Norway pout Plaice Long rough dab Common dab Grey gurnard Red gurnard No of fish 496 5 574 2 402 1 495 1 601 1 599 65 No of videos 1 2 3 1 2 3 1 Cod Haddock Whiting Saithe Hake Monk Mackerel No of fish 1 451 12 482 14 068 861 304 1 837 No of videos 3 18 13 2 2 1 Horse mackerel Norway pout Plaice Long rough dab Common dab Grey gurnard Red gurnard No of fish 496 5 574 2 402 1 495 1 601 1 599 65 No of videos 1 2 3 1 2 3 1 The commercial training samples were drawn from commercial footage and their species was determined manually. This is a time consuming and laborious process, hence the limited amount of commercial samples, as shown in Table 3. With a view to addressing this, Marine Scotland staff prepared placed large quantities of fish of known species on the research vessel conveyor belt and ran it past the camera. Applying the segmentation system allowed us to extract large numbers of training images of a known species class, resulting in the research training samples summarized in Table 4. This further illustrates the advantage of separating segmentation and species classification into separate steps, as mentioned in the “Separate species identification” section. The commercial training samples were extracted using manually prepared polygonal segmentation as the annotators used the labelling tool to provide both polygonal segmentation and species identification annotations for commercial images at the same time. In contrast, the majority of the research samples was extracted using boundaries generated by the segmentation system, with test-time augmentation in use. We should note that a system deployed in the field would not use test-time augmentation as segmenting each image multiple times under differing augmentation parameters incurs significant computational load. While as a consequence, a real-life species classifier would receive slightly lower quality segmentation labels than those used here, we believe that with the increased size of the training set that we are continually growing, this should not be a significant problem in the final application. It should be noted that the complex and unstructured scenes in our CCTV footage frequently feature fish that are oriented such that useful discriminative features or parts are hidden from view or fish that are only partially visible due to being occluded by overlapping fish or personnel working at the belt. Operating in these challenging conditions is one of the challenges posed by this project. Selected examples from each species are shown in Figure 5. Figure 5. View largeDownload slide Examples from the species identification dataset. All fish were from the single species research vessel footage, apart from monk which were taken from commercial footage. Samples were chosen to illustrate that the classifier often receives only a partial fish or one whose orientation hides useful details. Figure 5. View largeDownload slide Examples from the species identification dataset. All fish were from the single species research vessel footage, apart from monk which were taken from commercial footage. Samples were chosen to illustrate that the classifier often receives only a partial fish or one whose orientation hides useful details. Performance evaluation To understand the performance of our classifier we evaluate it in four scenarios. In our first scenario we train and test the classifier on research samples. Given the large number of available training samples, uniform lighting and appearance and the fact that there are typically less occlusions that in the commercial footage we expect this to provide an upper bound for the performance of our classifier. In our second scenario we train and test using commercial samples. There are considerably less training samples available and the conditions are more challenging so we expect our classifier to overfit the training data to a greater extent and exhibit worse performance. We also add the research samples to the training set to assess their effect. In our third scenario we use leave-one-belt-out cross validation to test on samples from one commercial belt and train on samples from the other commercial belts and the research samples. This scenario is more representative of a system deployed in the field that must operate on samples from a belt that was not in the training set. In our final scenario we train on research samples and test on commercial samples. This is by far the most challenging scenario for the classifier due to the domain gap between the research and commercial belts. It is also the ideal scenario from the perspective of preparing training data due to the reduced annotation effort. In scenarios in which samples from one or more belts are used for both training and testing we split the samples between train and test using fourfold cross validation. As stated in the “Segmentation and species ID training set” section individual fish may be seen in multiple successive frames extracted from video footage. We split samples using the video from which they were drawn (all the samples from a video are placed into either train or test), ensuring that a sample cannot appear in both the training and the test set. We present the performance of our classifier using a confusion matrix. Each row of the matrix shows the distribution of how samples of that class were predicted and mis-predicted by the classifier. The values along the diagonal give the class accuracies; the proportion of samples belonging to a class that are correctly identified by the classifier. Other entries in the same row show the proportion of samples mis-predicted as belonging to other classes. Perfect performance is indicated by 100% along the diagonal and 0% everywhere else. Train and test on research samples The research footage covers 13 species out of the 14 considered in this project. We do not consider monk as there are no examples in the research footage. We also skip mackerel, horse mackerel, long rough dab, and red gurnards as these species are only featured in one video each, preventing us from splitting the videos between train and test. When training and testing on research samples we obtain the performance shown in Figure 6. While deep neural network classifiers are effective, problems can arise when attempting to distinguish classes that are broadly visually similar, hence saithe being mis-predicted as haddock and common dab mistaken for plaice. The distribution of the confidence predicted by the classifier does not sufficiently differ between correctly and incorrectly predicted samples to allow one to reliably estimate the correctness of a specific prediction, however the difference would suggest that confidence could be used as a signal to prioritize difficult unannotated samples for manual annotation (Wang and Shang, 2014). Figure 6. View largeDownload slide Confusion matrix for research samples, fourfold cross validation. Figure 6. View largeDownload slide Confusion matrix for research samples, fourfold cross validation. Train and test on commercial samples Figure 7a and b shows the performance obtained on commercial samples when training using (a) commercial samples and (b) both commercial and research samples. Adding the research samples—of which there are ∼20 times as many as there are commercial—incurs the risk of the classifier being dominated by the research samples. Combining these datasets initially appears to degrade performance as the mean class accuracy drops from 59.16 to 56.71%. If we ignore the monk class due to lack of representation in the research samples the mean class accuracy increases from 59 to 62.05%. Adding the research samples with its large number of examples of whiting increases class accuracy, partially compensating for the poor whiting class accuracy in (b) due to the scarcity of whiting in the commercial samples. Figure 7. View largeDownload slide Confusion matrices for (a) train and test on commercial (fourfold cross validation), (b) train on research and commercial, test on commercial (fourfold cross validation), and (c) train on research, test on commercial. Without the monk class the mean class accuracies are (a) 59%, (b) 62.05%, and (c) 33.3%. Figure 7. View largeDownload slide Confusion matrices for (a) train and test on commercial (fourfold cross validation), (b) train on research and commercial, test on commercial (fourfold cross validation), and (c) train on research, test on commercial. Without the monk class the mean class accuracies are (a) 59%, (b) 62.05%, and (c) 33.3%. Leave-one-belt-out cross validation In practice a system such as the one discussed here would need to be deployed for usage on vessels for which there is no annotated training data. To assess the potential impact on performance in practical scenarios we trained five classifiers, each one on samples from four out of five vessels, with samples from the remaining vessel held out for testing. The results are presented in Figure 8. The large variation in performance evident in (b) and (d) when evaluating on samples from Vessel B and Vessel D indicates per-belt bias in the training samples that needs to be explored further. The reduction in accuracy in comparison to that in Figure 7 illustrates the effect of the domain gap. Figure 8. View largeDownload slide Performance when evaluating on samples from one vessel while training on others. Overall performance the result of computing the sum of the other confusion matrices. Overall mean class accuracy without under-represented monk class is 52.6%. Figure 8. View largeDownload slide Performance when evaluating on samples from one vessel while training on others. Overall performance the result of computing the sum of the other confusion matrices. Overall mean class accuracy without under-represented monk class is 52.6%. Train on research and test on commercial samples The performance obtained from training with samples from research footage that contains only cod, haddock, whiting, saithe, and hake and testing on the commercial samples is shown in Figure 7c. Comparing the performance between (a) and (c) illustrates the effect of the domain gap; in spite of the fact that there are ∼20 times as many research samples as commercial, training using only research samples results in considerably worse accuracy, with significant numbers of samples from all classes being mis-predicted as whiting. Inter-observer variability experiment In this section, we describe the species identification inter-observer variability experiment that was designed to measure the accuracy of expert human observers, against which we compare the accuracy of our classifier. Two hundred and fifty images of fish were extracted from the mixed species footage. Their background was darkened and blurred to suppress irrelevant cues and they were oriented horizontally. These images were presented to expert observers in a web–based tool—see Figure 9—that asked them to assign a species and difficulty rating to each image. The species identification tool was integrated into the project web application. It allows users to pan and zoom to focus on fine details. The user may choose a more comfortable orientation using the controls along the top to flip the image or rotate it by 180∘. Figure 9. View largeDownload slide Inter-observer variability species identification tool as seen by the participants. Figure 9. View largeDownload slide Inter-observer variability species identification tool as seen by the participants. We selected fish from the mixed species data as these are representative of real-world conditions. We decided that we needed at least 50 instances of each species used in the experiment to ensure sufficient representation for the purpose of meaningful analysis. Given the class imbalance present in our data (see Table 3) we used the existing species annotations to select samples for the dataset. While these individual fish had been previously annotated by Marine Scotland staff who later participated in this experiment, the samples were originally annotated in the context of a complete image including other fish, the conveyor belt and surroundings, whereas in this experiment the fish were extracted from their surroundings. The requirement of 50 samples per class prevented us from using monk in our assessment due to insufficient availability of samples. Fifty samples were selected from the remaining five classes (cod, haddock, whiting, saithe, and hake), hence the dataset containing 250 samples. We should note that observers from Marine Scotland reported that several samples belonged to species that could not be chosen from the five species available. Due to the fact that we did not anticipate this situation, no option indicating a different species was available, so the observers chose a combination of unidentifiable species with very easy difficulty. This issue persists in our data and would need to be corrected in future experiments. Expert observer agreement We present our results in the confusion matrices shown in Figure 10. Each confusion matrix compares the species choices of one observer with the majority choice of the other seven. Figure 10. View largeDownload slide Inter-observer agreement confusion matrices. Each confusion matrix compares the species choice of an observer with the majority vote of the other seven observers. Figure 10. View largeDownload slide Inter-observer agreement confusion matrices. Each confusion matrix compares the species choice of an observer with the majority vote of the other seven observers. The expert observers are largely in agreement with one another with mean class accuracy scores ranging from 74.4 to 86%, with the exception of observer 6 with a score of 51.4% due to low scores on whiting and hake. Comparing the classifier with expert observers We use the majority species choice for each sample in the inter-observer variability dataset as the ground truth for evaluating three classifiers: one trained on single species samples from the research vessel, one trained on the mixed species samples from the commercial vessels and one trained on a combination of both. In each case the samples in the inter-observer variability dataset are held out as test data with other samples used for training. The results are presented in Figure 11. Following the leave one belt out strategy discussed in the “Leave-one-belt-out cross validation” section, we obtain the results in Figure 12. Figure 11. View largeDownload slide Classifier predictions in comparison to those of the expert observers. Figure 11. View largeDownload slide Classifier predictions in comparison to those of the expert observers. Figure 12. View largeDownload slide Classifier predictions in comparison to those of the expert observers; evaluate on samples from one vessel while training on samples from others. Figure 12. View largeDownload slide Classifier predictions in comparison to those of the expert observers; evaluate on samples from one vessel while training on samples from others. The comparison between the agreement between human observers shown in Figure 10 and the performance of the classifier shown in Figure 11 show that there is a significant gap that must be crossed before human accuracy is reached, especially when crossing the domain gap as in Figure 12. Expert human observers typically score a mean class accuracy of between 74 and 86%, whereas the classifier reaches around 58%, slightly out-performing observer 6, the lowest scoring human observer. Conclusions and future work We have discussed the development of a system for analysing and quantifying fish discards from CCTV footage captured on fishing trawlers. Is designed to operate in the challenging real-world conditions present in these environments. The major components of the system are in place. The remaining challenges include length estimation, tracking fish between frames and reidentification to handle situations where fish go out of view temporarily due to occlusion. There is a significant body of work on the topic of person re-identification (Li et al., 2018), some of which could be adapted to this problem. The segmentation system is performing adequately and we believe that its performance will continue to improve as more training data is gathered. The main outstanding challenge is improving the performance of the species classifier. The performance obtained using footage from the research vessel (shown in the “Train and test on research samples” section and Figure 6) demonstrates that effective species classification is possible given sufficient training data. Good performance on commercial samples was achieved for some species provided that training data from all belts was used (see Figure 7a and b). We believe that growing the number of annotated commercial samples will further improve performance, reaching that of the research footage. This would however involve considerable manual effort. This effort could be supported by improving the user interface of the annotation tools. We also note that active learning offers the possibility of estimating the difficulty of unannotated samples and using it to prioritize them for manual annotation, optimizing the use of the annotators’ time. The single species research footage proved to be a highly effective approach for gathering a large number of labelled training samples in an efficient manner, although it had the disadvantage of having relatively uniform lighting and visual characteristics. The effect of the domain gap can be seen by comparing the results presented in Figure 7a and c. An avenue we intend to explore with Marine Scotland staff involves the use of an on-shore conveyor belt that affords us the opportunity to change the belt material and appearance and modify the lighting to increase the diversity of visual characteristics expressed by the dataset. If this results in sufficient accuracy, this would support the efficient production of large quantities of annotated training samples. Active learning offers the possibility of estimating the difficulty of unannotated samples and using it to prioritize them for manual annotation, optimizing the use of the annotators’ time. Fine-grained classification is a field of on-going research aimed at developing classifiers that can distinguish between classes of objects whose overall appearance is very similar with only subtle or small differences differentiating them. Effective fine-grained classifiers locate regions of an image—often bounding boxes—that are likely to be discriminative (Yang et al., 2018; Guo and Farrell, 2019). Such classifiers could be well suited to the problem of fish species identification. We can conclude that the use of computer vision to quantify fish discards from surveillance footage is feasible with current state-of-the-art algorithms. Acknowledgements We would like to thank James Dooley, Charlotte Altass, Luisa Barros, Lauren Clayton, and Anastasia Moutaftsi from Marine Scotland and Rebecca Skirrow from CEFAS for participating in our species identification inter-observer variability experiment. We would like thank nVidia corporation for their generous donation of a Titan X GPU. Funding This work was funded under the European Union Horizon 2020 SMARTFISH project, Grant Agreement No. 773521. References Alsmadi M. K. S. , Omar K. B. , Noah S. A. , Almarashdah I. 2009 . Fish recognition based on the combination between robust feature selection, image segmentation and geometrical parameter techniques using artificial neural network and decision tree . International Journal of Computer Science and Information Security , 2 : 215 – 221 . WorldCat Beucher S. , Meyer F. 1993 . The morphological approach to segmentation: the watershed transformation. Mathematical morphology in image processing . Optical Engineering , 34 : 433 – 481 . WorldCat Boom B. J. , Huang P. X. , He J. , Fisher R. B. 2012 . Supporting ground-truth annotation of image datasets using clustering. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012) , pp. 1542 – 1545 . IEEE . Google Preview WorldCat COPAC Bradski G. 2000 . OpenCV . Dr. Dobb’s Journal of Software Tools . https://opencv.org/. WorldCat Chuang M.-C. , Hwang J.-N. , Williams K. 2016 . A feature learning and object recognition framework for underwater fish images . IEEE Transactions on Image Processing , 25 : 1862 – 1872 . Google Scholar PubMed WorldCat Donahue J. , Jia Y. , Vinyals O. , Hoffman J. , Zhang N. , Tzeng E. , Darrell T. 2014 . Decaf: a deep convolutional activation feature for generic visual recognition . In International Conference on Machine Learning , pp. 647 – 655 . WorldCat Evangelidis G. D. , Psarakis E. Z. 2008 . Parametric image alignment using enhanced correlation coefficient maximization . IEEE Transactions on Pattern Analysis and Machine Intelligence , 30 : 1858 – 1865 . Google Scholar Crossref Search ADS PubMed WorldCat French G. , Fisher M. , Mackiewicz M. , Needle C. 2015 . Convolutional neural networks for counting fish in fisheries surveillance video. In Proceedings of Machine Vision of Animals and Their Behaviour Workshop at the 26th British Machine Vision Conference. French G. , Mackiewicz M. , Fisher M. 2018 . Self-ensembling for visual domain adaptation. In International Conference on Learning Representations. https://openreview.net/forum? id=rkpoTaxA-. Guo P. , Farrell R. 2019 . Aligned to the object, not to the image: a unified pose-aligned representation for fine-grained recognition. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) , pp. 1876 – 1885 . IEEE . Google Preview WorldCat COPAC He K. , Gkioxari G. , Dollár P. , Girshick R. 2017 . Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV), pp. 2980 – 2988 . IEEE . He K. , Zhang X. , Ren S. , Sun J. 2016 . Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770 – 778 . Hu B. G. , Gosine R. , Cao L. X. , de Silva C. 1998 . Application of a fuzzy classification technique in computer grading of fish products . IEEE Transactions on Fuzzy Systems , 6 : 144 – 152 . Google Scholar Crossref Search ADS WorldCat Hu J. , Li D. , Duan Q. , Han Y. , Chen G. , Si X. 2012 . Fish species classification by color, texture and multi-class support vector machine using computer vision . Computers and Electronics in Agriculture , 88 : 133 – 140 . Google Scholar Crossref Search ADS WorldCat Kluyver T. , Ragan-Kelley B. , Pérez F. , Granger B. E. , Bussonnier M. , Frederic J. , Kelley K. et al. . 2016 . Jupyter Notebooks—a publishing format for reproducible computational workflows. In 20th International Conference on Electronic Publishing, pp. 87 – 90 . Krizhevsky A. , Sutskever I. , Hinton G. E. 2012 . ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pp. 1097 – 1105 . Li M. , Zhu X. , Gong S. 2018 . Unsupervised person re-identification by deep learning tracklet association. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 737 – 753 . Lin T.-Y. , Maire M. , Belongie S. , Hays J. , Perona P. , Ramanan D. , Dollár P. , Zitnick C. L. 2014 . Microsoft, coco: common objects in context. In European Conference on Computer Vision , pp. 740 – 755 . Springer . Google Preview WorldCat COPAC Long J. , Shelhamer E. , Darrell T. 2015 . Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431 – 3440 . Mathiassen J. R. , Misimi E. , Bondø M. , Veliyulin E. , Østvik S. O. 2011 . Trends in application of imaging technologies to inspection of fish and fish products . Trends in Food Science & Technology , 22 : 257 – 275 . Google Scholar Crossref Search ADS WorldCat Needle C. L. , Dinsdale R. , Buch T. B. , Catarino R. M. D. , Drewery J. , Butler N. 2014 . Scottish science applications of remote electronic monitoring . ICES Journal of Marine Science , 72 : 1214 – 1229 . Google Scholar Crossref Search ADS WorldCat Paszke A. , Gross S. , Chintala S. , Chanan G. , Yang E. , DeVito Z. , Lin Z. et al. . 2017 . Automatic Differentiation in PyTorch. Neural Information Processing Systems Autodiff Workshop, Long Beach, CA, USA. Qin H. , Li X. , Liang J. , Peng Y. , Zhang C. 2016 . Deepfish: accurate underwater live fish recognition with a deep architecture . Neurocomputing , 187 : 49 – 58 . Google Scholar Crossref Search ADS WorldCat Ren S. , He K. , Girshick R. , Sun J. 2015 . Faster R-CNN: towards real-time object detection with region proposal networks . Advances in Neural Information Processing Systems , 28 : 91 – 99 . WorldCat Ronneberger O. , Fischer P. , Brox T. 2015 . U-Net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention , pp. 234 – 241 . Springer . Google Preview WorldCat COPAC Russakovsky O. , Deng J. , Su H. , Krause J. , Satheesh S. , Ma S. , Huang Z. et al. . 2015 . ImageNet large scale visual recognition challenge . International Journal of Computer Vision , 115 : 211 – 252 . Google Scholar Crossref Search ADS WorldCat Saenko K. , Kulis B. , Fritz M. , Darrell T. 2010 . Adapting visual category models to new domains. In European Conference on Computer Vision , pp. 213 – 226 . Springer . Google Preview WorldCat COPAC Simonyan K. , Zisserman A. 2015 . Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations. Storbeck F. , Daan B. 2001 . Fish species recognition using computer vision and a neural network . Fisheries Research , 51 : 11 – 15 . Google Scholar Crossref Search ADS WorldCat Strachan N. J. C. 1993 . Recognition of fish species by colour and shape . Image and Vision Computing , 11 : 2 – 10 . Google Scholar Crossref Search ADS WorldCat Sun X. , Shi J. , Dong J. , Wang X. 2016 . Fish recognition from low-resolution underwater images. In 2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) , pp. 471 – 476 . IEEE . Google Preview WorldCat COPAC Tayama I. , Shimdate M. , Kubuta N. , Nomura Y. 1982 . Application of optical sensor for fish sorting . Refrigeration , 57 : 1146 – 1150 . WorldCat van der Walt S. , Schönberger J. L. , Nunez-Iglesias J. , Boulogne F. , Warner J. D. , Yager N. , Gouillart E. et al. . 2014 . scikit-image: image processing in Python . PeerJ , 2 : e453 . Google Scholar Crossref Search ADS PubMed WorldCat Wang D. , Shang Y. 2014 . A new active labeling method for deep learning. In 2014 International Joint Conference on Neural Networks (IJCNN) , pp. 112 – 119 . IEEE . Google Preview WorldCat COPAC White D. J. , White C. J. , Svellingen C. , Strachan N. C. J. 2006 . Automated measurement of species and length of fish by computer vision . Fisheries Research , 80 : 203 – 210 . Google Scholar Crossref Search ADS WorldCat Xie S. , Tu Z. 2015 . Holistically-nested edge detection . In Proceedings of the IEEE International Conference on Computer Vision , pp. 1395 – 1403 . WorldCat Yang Z. , Luo T. , Wang D. , Hu Z. , Gao J. , Wang L. 2018 . Learning to navigate for fine-grained classification . In Proceedings of the European Conference on Computer Vision (ECCV) , pp. 420 – 435 . WorldCat Zheng Z. , Guo C. , Zheng X. , Yu Z. , Wang W. , Zheng H. , Fu M. et al. . 2018 . Fish recognition from a vessel camera using deep convolutional neural network and data augmentation. In 2018 OCEANS-MTS/IEEE Kobe Techno-Oceans (OTO) , pp. 1 – 5 . IEEE . Google Preview WorldCat COPAC Zion B. 2012 . The use of computer vision technologies in aquaculture—a review . Computers and Electronics in Agriculture , 88 : 125 – 132 . Google Scholar Crossref Search ADS WorldCat © International Council for the Exploration of the Sea 2019. All rights reserved. For permissions, please email: [email protected] This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)