Access the full text.
Sign up today, get DeepDyve free for 14 days.
Jorge Calvo-Zaragoza, D. Rizo (2018)
End-to-End Neural Optical Music Recognition of Monophonic ScoresApplied Sciences, 8
Jorge Calvo-Zaragoza, D. Rizo (2018)
Camera-PrIMuS: Neural End-to-End Optical Music Recognition on Realistic Monophonic Scores
D. Bainbridge, T. Bell (2001)
The Challenge of Optical Music RecognitionComputers and the Humanities, 35
Xing Hao, Guigang Zhang, Shang Ma (2016)
Deep LearningInt. J. Semantic Comput., 10
Alex Graves (2012)
Supervised Sequence Labelling with Recurrent Neural Networks, 385
Jorge Calvo-Zaragoza, L. Micó, J. Oncina (2016)
Music staff removal with supervised pixel classificationInternational Journal on Document Analysis and Recognition (IJDAR), 19
Jorge Calvo-Zaragoza, A. Toselli, E. Vidal (2019)
Handwritten Music Recognition for Mensural notation with convolutional recurrent neural networksPattern Recognit. Lett., 128
Ana Rebelo, Ichiro Fujinaga, F. Paszkiewicz, A. Marçal, C. Guedes, Jaime Cardoso (2012)
Optical music recognition: state-of-the-art and open issuesInternational Journal of Multimedia Information Retrieval, 1
Fabrizio Pedersoli, G. Tzanetakis (2016)
Document segmentation and classification into musical scores and textInternational Journal on Document Analysis and Recognition (IJDAR), 19
Diederik Kingma, Jimmy Ba (2014)
Adam: A Method for Stochastic OptimizationCoRR, abs/1412.6980
J. Demšar (2006)
Statistical Comparisons of Classifiers over Multiple Data SetsJ. Mach. Learn. Res., 7
(2018)
Construcción de un corpus de referencia para investigación en reconocimiento automático de partituras musicales
Luciano Mengarelli, Bruno Kostiuk, J. Vitório, Maicon Tibola, W. Wolff, C. Silla (2019)
OMR metrics and evaluation: a systematic reviewMultimedia Tools and Applications, 79
Arnau Baró, Carles Badal, A. Fornés (2020)
Handwritten Historical Music Recognition by Sequence-to-Sequence with Attention Mechanism2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR)
Yann LeCun, Yoshua Bengio, Geoffrey Hinton (2015)
Deep LearningNature, 521
Ana Rebelo, G. Capela, Jaime Cardoso (2010)
Optical recognition of music symbolsInternational Journal on Document Analysis and Recognition (IJDAR), 13
J. Burgoyne, L. Pugin, Greg Eustace, Ichiro Fujinaga (2007)
A Comparative Survey of Image Binarisation Algorithms for Optical Recognition on Degraded Musical Sources
C. Raphael, Jingya Wang (2011)
New Approaches to Optical Music Recognition
María Alfaro-Contreras, D. Rizo, J. Iñesta, Jorge Calvo-Zaragoza (2021)
OMR-assisted transcription: a case study with early prints
Andrew Hankinson, Perry Roland, Ichiro Fujinaga (2011)
The Music Encoding Initiative as a Document-Encoding Framework
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks
E. Miranda (2001)
Composing Music with Computers
M. Good (2001)
MusicXML: An internet-friendly format for sheet music
Anjan Dutta, U. Pal, A. Fornés, J. Lladós (2010)
An Efficient Staff Removal Approach from Printed Musical Documents2010 20th International Conference on Pattern Recognition
Donald Byrd, J. Simonsen (2015)
Towards a Standard Testbed for Optical Music Recognition: Definitions, Metrics, and Page ImagesJournal of New Music Research, 44
Jorge Calvo-Zaragoza, Jan Hajic, Alexander Pacha (2019)
Understanding Optical Music RecognitionACM Computing Surveys (CSUR), 53
Baoguang Shi, X. Bai, C. Yao (2015)
An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text RecognitionIEEE Transactions on Pattern Analysis and Machine Intelligence, 39
Arnau Baró, Pau Riba, Jorge Calvo-Zaragoza, A. Fornés (2019)
From Optical Music Recognition to Handwritten Music Recognition: A baselinePattern Recognit. Lett., 123
Alex Graves, Jürgen Schmidhuber (2008)
Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks
Antonio Gallego, Jorge Calvo-Zaragoza (2017)
Staff-line removal with selectional auto-encodersExpert Syst. Appl., 89
A. Fornés, Gemma Sánchez (2014)
Analysis and Recognition of Music Scores
María Alfaro-Contreras, Jorge Calvo-Zaragoza, J. Iñesta (2019)
Approaching End-to-End Optical Music Recognition for Homophonic Scores
Alexander Pacha, H. Eidenberger (2017)
Towards a Universal Music Symbol Classifier2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 02
Arnau Baró, Pau Riba, A. Fornés (2018)
A Starting Point for Handwritten Music Recognition
Ronald Williams, D. Zipser (1995)
Gradient-based learning algorithms for recurrent networks and their computational complexity
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations
Alexander Pacha, Jorge Calvo-Zaragoza, Jan Hajic (2019)
Learning Notation Graph Construction for Full-Pipeline Optical Music Recognition
The recognition of patterns that have a time dependency is common in areas like speech recognition or natural language processing. The equivalent situation in image analysis is present in tasks like text or video recognition. Recently, Convolutional Recurrent Neural Networks (CRNN) have been broadly applied to solve these tasks in an end-to-end fashion with successful performance. However, its application to Optical Music Recognition (OMR) is not so straightforward due to the presence of different elements sharing the same horizontal position, disrupting the linear flow of the timeline. In this paper, we study the ability of the state-of-the-art CRNN approach to learn codes that represent this disruption in homophonic scores. In our experiments, we study the lower bounds in the recognition task of real scores when the models are trained with synthetic data. Two relevant conclusions are drawn: (1) Our serialized ways of encoding the music content are appropriate for CRNN-based OMR; (2) the learning process is possible with synthetic data, but there exists a glass ceiling when recognizing real sheet music. Keywords Optical music recognition · Deep learning · End-to-end recognition · Music encoding 1 Introduction • International Music Score Library Project • Répertoire International des Sources Musicales (RISM) Music amounts to a language used and understood world- • Choral Public Domain Library wide. It is an art that has been crossing borders since its • Mutopia inception, being one of the main cultural manifestations of the • Classical Archives Collection human being. It is for this reason that over the centuries there • OpenScore has been a need to preserve the content in the best possible • Cantus Manuscript Database way, whether in cathedrals, libraries, or historical archives. However, access to these documents is often limited since All of them are making a great effort to digitize musical continued use may end up damaging them irretrievably. scores into images, allowing their collections to be accessible There exist multiple projects and organizations whose as images through the Internet. But for those musical doc- purpose is comprehensively documenting extant historical uments to be truly accessible, they must be transcribed into sources of music all over the world, such as the following a digital format that enables tasks such as indexing, editing, ones: or critical publication. This process is often done manually and is costly and tedious. Score editing tools are complex to use, which makes the process prone to introducing errors; B María Alfaro-Contreras malfaro@dlsi.ua.es https://imslp.org. José M. Iñesta http://www.rism.info. inesta@dlsi.ua.es 3 http://www.cpdl.org. Jorge Calvo-Zaragoza https://www.mutopiaproject.org. jcalvo@dlsi.ua.es 5 https://www.classicalarchives.com. https://openscore.cc. Instituto Universitario de Investigación Informática, University of Alicante, Ap. 99, 03080 Alicante, Spain http://cantus.uwaterloo.ca. 0123456789().: V,-vol 123 12 Page 2 of 13 International Journal of Multimedia Information Retrieval (2023) 12:12 (a) The recognition process of text does follow a linear left-to-right flow. Fig. 1 By using OMR techniques, the content in a digitized image can be encoded in a symbolic format (b) The recognition process of music does not follow thereby, several rounds of review are needed to approve a a linear left-to-right flow. transcript as a good one. In certain scenarios, such as those Fig. 2 Differences in reading between text and music related to ancient musical documents, there may not even be suitable tools for that. All of this entails a great deal of work that is not feasible on a large scale. That is why it is very important to find technologies capable of revaluing all the existing musical heritage. Using this approach, training pairs only have to contain the A promising alternative that would overcome the previ- input image and its complete transcription [5, 13], bypassing ous challenge is the use of automatic recognition techniques. especially the need to annotate the exact positions of indi- Here is where Optical Music Recognition (OMR) comes in. vidual symbols. OMR is the field of research that investigates how to compu- These approaches typically rely on Convolutional Recur- tationally read music notation in documents [9, 16]. rent Neural Networks (CRNN), which are only able to As seen in Fig. 1, a digitized image can be converted formulate the output as one-dimensional sequences. This per- into encoded content automatically by means of OMR. This fectly fits natural language tasks (text or speech recognition, encoded content is the digital transcription (in terms of music or machine translation) since their outputs mostly consist of notation symbols) of the score. Thus, an effective OMR sys- character (or word) sequences (see Fig. 2a). However, its tem enables the study of existing musical documents for application to music notation is not so straightforward due digital humanities. And not only that, but it is the only alter- to the presence of different elements sharing the same hor- native capable of doing so in reasonable time and cost. izontal position and long-term dependencies. The vertical OMR has been an active research field for decades distribution of these elements disrupts the linear flow of the [4, 32]. Traditional approaches to OMR are based on the timeline (see Fig. 2b). This fact is not trivial to encode and usual pipeline of sub-tasks that characterizes many artifi- can cause significant difficulties in the performance of recog- cial vision systems, adapted to this particular task: document nition systems that make use of the temporal relationships pre-processing [7, 29]—including staff-line removal [10, between the recognized elements. 15]—symbol classification [28, 31], reconstruction of the The problem can be drastically simplified by consider- music notation [27, 30], and output encoding in a suitable ing that the process will work with each staff independently symbolic format. from the others—a process that could be analogous to the In recent years, there has been a paradigm shift toward text recognition systems that decompose the document into the use of Machine Learning (ML) techniques. These tech- a series of independent lines [21]. This is not a strong assump- niques make it possible to design flexible and versatile OMR tion as there are successful algorithms for identifying staves systems capable of solving a wide variety of problems. This [17]. Even so, we still have to deal with elements that take is due to the relationship between the purpose of both fields: place simultaneously in the “time” line, like the notes that ML studies how to make machines learn to perform certain make up a chord, irregular groups, or expression marks, to tasks, which is exactly what OMR seeks, to teach machines name a few. to perform the task of reading musical scores. Within the range of music score complexities, one possi- Recent advances in ML—namely Deep Learning (DL)— ble simplification of the problem that applies to many sheet which have achieved great results in several visual challenges music is to assume a homophonic music context. In that case, [24], allow us to be optimistic about developing more accu- there are multiple parts, but they move in the same rhythm. rate and effective OMR systems. The current trend is the use This way, multiple notes can occur simultaneously, but only of end-to-end (or holistic) systems that treat the process as as a single voice. Therefore, all the notes starting at the same a single step, instead of explicitly performing the sub-tasks. time last the same, so the score can be segmented into verti- 123 International Journal of Multimedia Information Retrieval (2023) 12:12 Page 3 of 13 12 shows the results obtained with real homophonic sheet music; finally, Sect. 5 concludes the present work, along with some ideas for future research. Fig. 3 Homophonic music: All the notes starting at the same time have the same duration 2 Recognition framework To carry out the OMR task in an end-to-end manner, we follow the state-of-the-art approach based on Convolutional Recurrent Neural Networks (CRNN). These neural archi- Fig. 4 Example of ambiguities that might appear when music symbols tectures permit us to model the posterior probability of belong to a vertical distribution. The two notes that appear together, generating output symbols, given an input image. Input must be played at the same time, but a linear symbol sequence without specific marks cannot be interpreted unambiguously images are assumed to be single-staff sections, analogously to text recognition that assumes independent lines [21]. As mentioned before, staves can be easily isolated by means of cal slices that may contain one or more music symbols (see existing methods [17]. Fig. 3). A CRNN consists of one block of convolutional layers Even in this simplified context, there is a need for a clear followed by another block of recurrent layers [33]. The con- and structured output coding that avoids the ambiguities that volutional block is responsible for learning how to process the representation of a linear output can show in presence of the input image, that is, extracting relevant image features for vertical structures in the data (see Fig. 4). This has also been the task at issue so that the recurrent layers interpret these stated in some previous works [6]. features in terms of sequences of musical symbols. In this There already exist several structured formats for music work, the recurrent layers are implemented as Bidirectional representation and coding, like XML-based music formats Long Short-Term Memory (BLSTM) units [19]. [18, 22] that are focused on how the score has to be encoded The unit activations of the last convolutional layer can be to properly store all its content. This application makes it seen as a sequence of feature vectors representing the input unsuitable to adopt them as output for an optical recognition image, x. These features are fed to the first BLSTM layer, and system because the code is full of irrelevant markings for the unit activations of the last recurrent layer are considered the system to generate when graphically analyzing the score. estimates of the posterior probabilities for each vector: An OMR system is primarily interested in what symbols are there and where they are. Furthermore, these XML-based P(σ | x, f ), 1 ≤ f ≤ F,σ ∈ (1) music languages do not represent sequential data, but rather hierarchical structures, so they are not suitable output formats where F is the number of feature vectors of the input sequence for DL-based OMR. and is the set of considered symbols. Note that must Due to that, we have designed a specific coding language include a “non-character” symbol that acts as a separator to represent the output of end-to-end OMR, based on seri- when two or more instances of the same musical symbol alizing the music symbols found in a staff of homophonic appear consecutively [19]. music. The sequential nature of music reading must be com- Since both convolutional and recurrent blocks can be patible and unambiguous with respect to the representation trained through gradient descent, using the well-known Back of vertical distributions. In addition, this representation has Propagation algorithm [34], a CRNN can be jointly trained. to be easy to generate by the system, which analyzes the input However, a conventional end-to-end OMR training set only sequentially and produces a linear series of symbols. provides, for each staff image, its corresponding transcrip- Preliminary research has been carried out to validate tion, not giving any type of explicit information about the whether this represents a feasible research avenue [2]. How- location of the symbols in the image. It has been shown ever, the serialized ways of encoding the music content have that the CRNN can be conveniently trained without this never been tested with real data. This work aims to solve information by using the so-called Connectionist Temporal that and properly evaluate the problem. Also, we study and Classification (CTC) loss function [20]. The resulting CTC evaluate the possible existing boundaries of learning with training procedure is a form of Expectation-Maximization: synthetic data when recognizing real music scores. CTC provides a means to optimize the CRNN parameters so The rest of the paper is organized as follows: Sect. 2 that it is likely to give the correct sequence given an input overviews the state-of-the-art recognition framework based [20]. on DL, including the ad hoc serializations for homophonic Once the CRNN has been trained, an input staff image can music; Sect. 3 introduces the experimental setup; Sect. 4 be decoded into a sequence of music symbols s ˆ ∈ . First, 123 12 Page 4 of 13 International Journal of Multimedia Information Retrieval (2023) 12:12 Fig. 5 Graphical scheme of the CRNN considered for the end-to-end approach. The network is trained by using the CTC loss function the most probable symbol per frame is computed: the musical symbols are represented. The grammar for these musical codifications must be unambiguous, allowing us to σ ˆ = arg max P(σ | x, i ), 1 ≤ i ≤ F (2) analyze a given document in only one way. σ ∈ Our representation does not make assumptions about the musical meaning of what is represented in the document Then, a pseudo-optimal output sequence is obtained as: being analyzed; that is, the elements are identified in a cata- logue of musical symbols by their shape and where they are s ˆ = arg max P(s | x) ≈ D(σ ˆ ,..., σ ˆ ) (3) 1 F s∈ placed on the staff. This has been referred to as “agnostic rep- resentation,” as opposed to a semantic representation, where where D is a function that first merges all the consecutive music symbols are encoded according to what they represent frames with the same symbol and then deletes the symbols in terms of music notation [11]. This difference is illustrated [19]. in Fig. 6. A graphical scheme of the framework explained above is As mentioned before, the only difference between the four given in Fig. 5. proposed musical codes is how to represent the horizontal and This framework follows the architecture first applied in vertical dimensions. Each one of the four codes has one or the work of Shi et al. [33] and later tuned by Calvo-Zaragoza two characters that indicate whether, when transcribing the et al. [11]. As stated in the latter work, its expressiveness score, the system should move forward, that is, from left to could be sufficient when working with simple scores where right, or downward, from top to bottom. These characters are all the symbols have a single left-to-right order. However, referred to as separators. we want to extend these approaches so that they are able The four different codes proposed are described as fol- to model richer scores such as those of homophonic sheet lows: music. In such a case, issues like chords may appear, where several symbols share a horizontal position. As seen in Fig. 4, a one-dimensional sequence is not expressive enough for • Remain-at-position character code when transcribing the this situation. That is why in the next section we describe score, the different musical symbols are assumed to be our representation proposal to perform end-to-end OMR for placed left to right, except when they are in the same homophonic scores. horizontal position. In that case, they are separated by a slash, “/”. This acts as a remain-at-position character, 2.1 Serialization proposals meaning that the system has to advance downward (see Fig. 7b). This behavior is similar to the backspace of The research developed in this paper involves the study typewriters. The carriage advances after typing, and if we of four different deterministic, unambiguous, and serialized want to align two symbols, we need to keep the carriage representations to encode the kind of scenarios that happen in a fixed position (by moving it back to one position). in homophonic music so that the OMR system becomes more • Advance-position character code this type of coding uses effective when recognizing complex music score images. For a “+” sign to force the system to advance forward. This that, we propose four different types of music representa- way, when that sign is missing, the output does not move tions that differ not in the encoding of the musical symbols forward and a vertical distribution is being coded (see itself, but in the way horizontal and vertical distributions of Fig. 7c). 123 International Journal of Multimedia Information Retrieval (2023) 12:12 Page 5 of 13 12 (a) Musical excerpt. clef.G:L2 accidental.flat:S3 digit.2:L4 / digit.4:L2 rest.eighth:L3 dot:S3 note.quarter:S2 / slur.start:S2 note.sixteenth:S2 / slur.end:S2 verticalLine:L1 note.quarter:L3 / note.quarter:L2 / note.quarter:L1 note.beamedRight:S2 note.beamedLeft:L2 verticalLine:L1 (b) Remain-at-position character code. clef.G:L2 + accidental.flat:S3 + digit.2:L4 digit.4:L2 + rest.eighth:L3 + dot:S3 + note.quarter:S2 Fig. 6 Semantic and agnostic representations, respectively, of the two slur.start:S2 + note.sixteenth:S2 slur.end:S2 + verticalLine:L1 + note.quarter:L3 note.quarter:L2 note.quarter:L1 + note.beamedRight:S2 + note.beamedLeft:L2 + verticalLine:L1 eighth notes forming a beam group (highlighted in red). The agnostic representation only provides graphic information—that is, the line or (c) Advance-position character code. space of the staff on which the note is placed, or the direction of its beam—as opposed to the musical information (pitch and type of note) clef.G:L2 accidental.flat:S3 vertical.start digit.2:L4 digit.4:L2 vertical.end rest.eighth:L3 dot:S3 vertical.start note.quarter:S2 slur.start:S2 vertical.end vertical.start note.sixteenth:S2 slur.end:S2 of a semantic representation (colour figure online) vertical.end verticalLine:L1 vertical.start note.quarter:L3 note.quarter:L2 note.quarter:L1 vertical.end note.beamedRight:S2 note.beamedLeft:L2 verticalLine:L1 (d) Parenthesized code. • Parenthesized code when a vertical distribution appears in the score, the system outputs a parenthesized structure, clef.G:L2 + accidental.flat:S3 + digit.2:L4 / digit.4:L2 + rest.eighth:L3 + dot:S3 + note.quarter:S2 / slur.start:S2 + note.sixteenth:S2 / slur.end:S2 + verticalLine:L1 + note.quarter:L3 / note.quarter:L2 / like vertical.start musical_symbol ... musical_symbol note.quarter:L1 + note.beamedRight:S2 + note.beamedLeft:L2 + verticalLine:L1 vertical.end (see Fig. 7d). (e) Verbose code. • Verbose code this last coding is a combination of the two first ones. It uses the “+” sign as the advance-position Fig. 7 Musical excerpt presenting a number of different situations where vertical distributions occur and its transcription using the pro- character to indicate that the system has to move for- posed codifications ward, and the “/” sign as the remain-at-position character to indicate that the system has to advance downward (see torted as in [12]. With both outputs, the necessary pairs for Fig. 7e). So, here, every two adjacent symbols are explic- the DL algorithm are obtained. itly separated by a symbol indicating whether the system For the generation system, three different methods of algo- must remain in the same horizontal position or has to rithmic composition are used to obtain compositions with advance to the next one. diverse musical features. Furthermore, the range of pitches that can be coded according to the clef is limited in order to Note that the codes represent the data unambiguously; achieve a score with as much musical coherence as possible. thus, it is possible to deterministically translate from any In the end, in music scores, according to the clef, there is a encoding to any other. range of pitches that is more common than another. A range of 22 different pitches is chosen for each clef (see Fig. 8). The pitch series defined for each clef are major or minor dia- 3 Experimental setup tonic scales in some keys. The clefs that can be encoded by the system are: G1-clef, G2-clef, F4-clef, C1-clef, C2-clef, In this section, we present the synthetic training dataset and C3-clef, and C4-clef. the real RISM test dataset, the neural model architecture, and The score generator creates a musical event one at a time. the evaluation protocols used. Such an event might be sound (by default 90% of the time) or 3.1 Synthetic data: corpus generation silence (10% the time). The type of sound or silence event is chosen from a catalog of possible music symbols (from six- teenth to whole notes, as well as silences, beamed notes, or As introduced above, the current trend for the development of OMR systems is to use DL techniques able to infer the tran- chords comprising up to a maximum of 3 notes, triplets, etc.) following a random process. The pitch of the sound events scription from correct examples of the task, namely, set of pairs (x, s). Given the complexity of music notation, for these is conditioned to one of three possible algorithmic composi- tion methods described below. Only a general outline of the techniques to produce satisfactory results, it is necessary to use a set of sufficient size. To achieve this, a system of auto- score generator is given, as more specific details are beyond the scope of this work. For complete details about the imple- matic generation of labeled data has been developed [1]by using algorithmic composition techniques [26]. The devel- mentation, please refer to [1]. oped system provides two outputs: the expected transcription of the generated score in any of the encodings described in The code developed in the work is publicly available for reproducible Sect. 2.1, and the corresponding score image artificially dis- research at: https://github.com/mariaalfaroc/ScoreGenerator.git. 123 12 Page 6 of 13 International Journal of Multimedia Information Retrieval (2023) 12:12 E6 C6 B5 A5 This curve is known as Gaussian bell and is the graph of a G5 E5 D5 Gaussian function. B4 A4 G4 F4 E4 B C4 A 3 G3 Fig. 10 shows an example of a musical excerpt created by F3 E3 the automatic generation system when the normal distribu- (a) G2-clef. tion composition method is used. F4 4 D 4 C4 B3 A3 G3 F3 3.1.2 Random walk E3 C 3 A B2 G2 F2 D E2 C2 2 B1 A1 G1 A random walk is a mathematical formalization of the tra- (b) F4-clef. jectory that results from making successive random steps. F5 In this system, the random walk always starts at the central E5 C5 5 B4 G4 E4 pitch of the pitch range determined for the system. There are D4 B 4 A 3 G 3 F3 E3 three possible random steps (all equally likely) after emitting C3 3 B2 G2 F2 pitch: (c) C3-clef. 1. One-step forward: The pitch that follows is the next Fig. 8 Range of the 22 pitches coded for the 3 system clefs considered higher in the defined pitch series. 2. One-step backward: The pitch that follows is the next lower in the defined pitch series. 3.1.1 Random generation according to the normal 3. No step: The pitch that follows is the same as the current distribution one. The normal distribution, Gaussian distribution, or Laplace- There are situations in which moving forward or backward Gauss distribution, is a probability distribution of continuous will not be allowed because the pitch to be coded would be variable. It is of great application in the fields of engineering, outside the established range. In these situations, two solu- physics, and social sciences because it allows us to model tions are given: numerous natural, social, and psychological phenomena. The normal distribution is defined in Eq. 4, where 0 and • Reflective limit as its own name indicates, it works like N represent the extreme values of the domain, being, in our a mirror, making the step to take to be the reflection of case, 0 and 21, respectively—integers associated with spe- what was initially intended to be taken. If the movement cific pitches depending on the clef used. This way, the range was intended to be forward, now it will be backward and of pitches associated with each clef links each pitch to an vice versa. That is, the pitch of the current note is the integer in [0, 21] so that 0 maps to the lowest pitch of the second of the range starting either from the upper limit range and 21 to the highest pitch. or from the lower one, as appropriate. • Absorbing limit the pitch of the current note is that cor- (x −μ) responding to the upper or lower limit, as appropriate, of exp − 2σ f (x ) = N (μ, σ ) = (4) √ the pitch range. [0, N ] 2πσ The solution is chosen randomly, both being equal of prob- We use the N (10.5, 6.5) distribution that provides a sym- able. metric distribution centered on the mean, which corresponds Fig. 11 shows an example of a musical excerpt created to the space surrounding the central line of the staff. The by the automatic generation system when the random walk mean defines the location of the peak for normal distribu- composition method is used. tions, but, in our case, since the mean is a decimal value, it is not associated with any pitch, and therefore the highest 3.1.3 Sonification of the logistic equation probabilities are given to both rounded up (third line of the staff) and rounded down (third space of the staff) values of The logistic equation is defined by Eq. 5: the mean. The probability is minimum at the extreme values, being f (0) = f ( N ) = 0.0166, i.e., 1.66% proba- x = rx (1 − x )wher e n = 0, 1, 2, 3, ... (5) [0, N ] [0, N ] n+1 n n bility of occurrence for those pitches. This is illustrated in Fig. 9. This equation defines an iteration, where x is equal to 0 and As we can see in Fig. 9, the graph of its density function the parameter r is a value between 0 and 4. The resulting has a bell shape and is symmetric with respect to the average. value will always be in [0, 1]. 123 International Journal of Multimedia Information Retrieval (2023) 12:12 Page 7 of 13 12 Fig. 9 Gaussian distribution for the automatic generation of music data Fig. 10 Music snippet created by the automatic generation system using the normal distribution ( N (10.5, 6.5)) composition method Fig. 13 Staff sample from the selected RISM corpus. Incipit RISM ID no. 000139189111 OMR task learns. Since what we want is to find the most possible optimal scenario for the OMR learning task, we Fig. 11 Music snippet created by the automatic generation system using consider this to be an issue that needs further investigation the random walk composition method and decide to approach it together with the coding proposals in the next section. 3.2 Real data Fig. 12 Music snippet created by the automatic generation system using We aim to find the most favorable scenario for the OMR the sonification of the logistic equation (r = 3.75)) composition method learning task when training with homophonic synthetic scores and testing with homophonic real scores. For a proper evaluation, we consider a sufficiently large set of real data When this method is used, the value for r should be between taken from the RISM repository that contains the same sym- 3.5 and 4, since it is the range of values for which the most bols that our score generator is able to produce. The selected interesting note sequences are generated. The sequences for corpus contains 1 954 real music staves of homophonic incip- 3 ≤ r ≤ 3.5 produce repetitions of 2 or 4 pitches, with no its. For each incipit, an image with the rendered score with variability. Values for r < 3 generate constant pitch (unison) artificial distortions—the same as those used for the syn- sequences after a short transition period. thetic data—as well as the expected transcription in any of Fig. 12 shows an example of a musical excerpt created by the encodings described in Sect. 2.1 is provided. Figure 13 the automatic generation system when the sonification of the depicts an example of a particular staff from this corpus. logistic equation composition method is used. It must be noted that the Camera-based Printed Images of The first and the last methods generate skipwise or disjunct Music Staves (Camera-PrIMuS) database [12]isnot suit- melodic motions, characterized by frequent skips between able for the present work since it contains only monophonic notes, whereas the second method produces stepwise or con- RISM-based music scores. junct melodic motions, where all the intervals will never be greater than four semitones. These differences lead us to con- sider whether the pitch generation method used to generate Short sequences of notes, typically the first measures of the piece, the training set affects the way the DL model used for the used to index and identify a melody or musical work. 123 12 Page 8 of 13 International Journal of Multimedia Information Retrieval (2023) 12:12 Table 1 Layer-wise description for the CRNN architecture considered. vation with a negative slope value α, MaxPool(w × h) represents the Notation: Conv( f ,w × h) stands for a convolution layer of f filters max-pooling operator of w × h dimensions and striding factors, and of size w × h pixels, BatchNorm performs the normalization of the BLSTM(n, d) denotes a bidirectional Long Short-Term Memory unit batch, LeakyReLU(α) represents a Leaky Rectified Linear Unit acti- with n neurons and d dropout value parameters Convolutional block Recurrent block Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Conv(64, 5 × 5) Conv(64, 5 × 5) Conv(128, 3 × 3) Conv(128, 3 × 3) BatchNorm BatchNorm BatchNorm BatchNorm BLSTM(256) BLSTM(256) LeakyReLU(0.20) LeakyReLU(0.20) LeakyReLU(0.20) LeakyReLU(0.20) Dropout(0.50) Dropout(0.50) MaxPool(2 × 2) MaxPool(1 × 2) MaxPool(1 × 2) MaxPool(1 × 2) The selected RISM set amounts to a total 63 011 music opposed to monophonic segments, we also show the propor- symbols, representing 341 different classes. From the total tion of NonSep-SER caused by the editing operations that number of symbols, 15 699 belong to a vertical distribution— happen in regions (i) that belong to a simultaneity in the two or more symbols that share the same horizontal position. ground truth, defined as Sim-SER, and (ii) that are mono- There are 7 546 vertical distributions. phonic in the ground truth, denoted as NonSim-SER. 3.3 Neural network configuration 4 Results As mentioned in Sect. 2, the neural model considered in this work is based on the architecture by Calvo-Zaragoza This work aims to provide insights into (i) which serialized et al. [11]. In this sense, while the configuration is broadly ways of encoding the music content presented in Sect. 2.1 described in Sect. 2, the actual composition of each layer is are more suitable for recognizing real homophonic music depicted in Table 1. scores and (ii) how the use of synthetic training data affects The model is trained following the backpropagation the transcription of the previously mentioned data. For that, method provided by CTC for 300 epochs using the ADAM we specifically consider two evaluation cases: a first one, optimizer [23] with a fixed learning rate 0.001 and a batch denoted as Best Encoding Experiment, devoted to finding size of 16 elements. the most suitable code out of the four proposed in Sect. 2.1 for the OMR output; and a second one, named Best Algorith- mic Composition Method Experiment, that studies the most 3.4 Evaluation protocol appropriate composition method for the OMR learning task. It must be noted that for all the considered scenarios the Concerning evaluation metrics, there is an open debate on evaluation set refers to the real homophonic scores collected how to evaluate the capabilities of OMR systems [8, 25]. from RISM. Hence, the synthetic data generated are cre- In this work, OMR is simply understood as a pattern recog- ated in a way that it contains a similar number of measures nition task, so we shall consider metrics that allow us to per staff, symbols per staff, and vertical distributions per draw reasonable conclusions from the experimental results. staff—all of them on average—as the selected RISM cor- Due to that, the performance of the recognition schemes pre- pus. Moreover, the generated data are distorted in the same sented is assessed by considering the symbol error rate (SER, way as the RISM data are. %) as utilized in previous works addressing end-to-end tran- scription tasks [12]. This figure of merit is computed as the average number of elementary editing operations (insertions, 4.1 Best encoding experiment deletions, or substitutions) necessary to match the sequence predicted by the model with the ground truth sequence, nor- The experiment aims to determine which of the four seri- malized by the length of the latter. alization proposals works best as the output for the OMR The number of separators used in the transcription to deal task in a homophonic music scenario. To do so, a corpus of with simultaneities in the horizontal dimension (the “time” 1 500 labeled scores, each consisting of a single staff, is gen- line) is different for the four proposed encodings. This fact erated using the system for automatic generation of labeled might have implications for learning. Hence, we also report data explained in the previous section. Each sample is a pair the results restricted to the symbols that are not separators: composed of the image with a rendered staff and its corre- We refer to the SER metric as NonSep-SER. To obtain more sponding representation with the format imposed by one of insights into the system’s performance on simultaneities as the four musical encodings proposed, like in the example 123 International Journal of Multimedia Information Retrieval (2023) 12:12 Page 9 of 13 12 Table 2 Results obtained for each encoding when models are tested on ing, Parenthesized stands for parenthesized coding, and Verbose stands the 1 954 selected RISM incipits. Remain stands for remain-at-position for verbose coding. Best results are highlighted in bold type character coding, Advance stands for advance-position character cod- Remain Advance Parenthesized Verbose SER (%) 26.8 17.1 26.3 19.1 NonSep-SER (%) 26.8 26.0 26.9 28.3 Sim-SER (%) 09.5 09.2 09.2 09.9 NonSim-SER (%) 17.3 16.8 17.7 18.4 shown in Fig. 7. As explained above, the operation of the Table 3 Statistical significance analysis of the different presented encoding schemes considering the Wilcoxon signed-rank test with a three composition methods of the automatic generation sys- significance value of p < 0.05 for the symbol error rate metric when tem makes them produce heterogeneous music scores. That the corresponding separators symbols are excluded from the computa- is why the three of them are equally used when generating tion, i.e., NonSep-SER. Symbols <, >,and = represent that the error this corpus so that it is not biased in favor of any particu- of the method in the row is significantly lower than, greater than, or no different to that in the column, respectively lar style. We refer to this procedure as the mix composition method. Remain Advance Parenthesized Verbose We derive two non-overlapping partitions—train and Remain – > = < validation—corresponding to 60% and 40% of the data, Advance < – << respectively, following a fivefold cross-validation scheme. Parenthesized = > – < Each fold is tested on the selected RISM corpus described in Verbose >> > – Sect. 3.2. The results obtained in terms of the SER metric— the figures provided represent the average values for the test partition in which the validation data achieve its best per- advance separator. On the opposite side, we find that the formance for each of the considered cases—are presented in remain and parenthesized codes overload the learning pro- Table 2. cess as we are forcing the system to output more symbols An initial remark is that the results depicted in Table 2 indi- in the same situation. This idea is reinforced by the verbose cate that the neural network is indeed learning from synthetic encoding’s results. Therefore, neither the remain-at-position data but, as seen in previous efforts [2], the encoding of the character code nor the parenthesized code nor the verbose output for this OMR task plays an important role in the train- code is a suitable choice for transcribing the content of homo- ing, and consequently, in the recognition performance. The phonic scores with the CTC objective function. advance-position character coding achieves the best results As the last remark, we would like to point out that the for the NonSep-SER. The remain-at-position character and results obtained suggest that simultaneities do not present a the parenthesized codes follow closely, both with similar recognition problem by themselves. When decomposing the results, while the verbose code places the last. This order NonSep-SER into its two component fractions, Sim-SER and is altered when the results are observed from the SER met- NonSim-SER, Table 2 reports that the highest proportion of ric side: Wordy encodings are favored, leading to deceiving errors occurs in monophonic zones. insights. In other words, a more wordy code, e.g., the verbose To support the relevance of those statements, we shall encoding, could predict all the separators symbols correctly now assess the results in terms of statistical significance. For while missing the remaining symbols—the ones that would that, we resort to the nonparametric Wilcoxon signed-rank really matter to the user—and still achieve a lower error rate test [14]. This analysis considers that each result obtained for than another less verbose encoding, e.g., the remain encod- each fold constitutes a sample of the distributions to compare. ing. Considering this assessment scheme, the results obtained are We believe that the fact that the advance encoding per- reported in Table 3. forms better is due to how the CTC loss function works: The The results obtained with a significance value of p < 0.05 system reads vertical slices and outputs the symbols present show that the advance-position character coding has signifi- on them. In the case of contiguous input frames contain- cant differences with respect to the other representations and ing only staff lines, the system could output either nothing therefore it will be used in all further experiments. or an “empty-output” symbol, such a symbol being the “+” 123 12 Page 10 of 13 International Journal of Multimedia Information Retrieval (2023) 12:12 Table 4 Results obtained for each composition method when models are tested on the 1 954 selected RISM incipits. Normal stands for normal distribution method, Random stands for random walk method, Logistic stands for sonification of the logistic equation method, and Mix stands for mix method (it refers to the equal use of the three previous methods in the data set). Best results are highlighted in bold type Normal Random Logistic Mix SER (%) 21.1 16.8 39.8 17.1 NonSep-SER (%) 32.7 25.2 67.2 26.0 Sim-SER (%) 10.3 09.3 14.6 09.2 NonSim-SER (%) 22.4 15.9 52.6 16.8 Fig. 14 Average results obtained on the RISM data for different sizes of Table 5 Statistical significance analysis of the different presented com- the synthetic training corpus. The synthetic procedure follows the mix position methods considering the Wilcoxon signed-rank test with a method since it has achieved, on average, lower error rates (see Table 6). significance value of p < 0.05 for the symbol error rate metric. Sym- Nevertheless, the conclusions can be extrapolated to the random walk bols <, >,and = represent that the error of the method in the row method. Nonlinear least squares are used to fit the data is significantly lower than, greater than, or no different to that in the column, respectively Normal Random Logistic Mix The results obtained with a significance value of p < 0.05 Normal – >< > show that random and mix methods significantly outperform the other composition strategies while showing no significant Random < – < = differences between them. This implies that the two compo- Logistic >> – > sition algorithms are most suitable for generating synthetic Mix < = < – training data. We would like to reduce the error figures on the selected RISM corpus by exploiting the fact that we have an “infinite” 4.2 Best algorithmic composition method generator; that is, thanks to the automatic generation system experiment we will always be able to generate new data that the neu- ral network has not probably seen before. To gain insights This experiment is addressed to identify which algorithm(s) into this issue first, we compute the greatest achievable per- for the synthetic generation of data works better for real data. formance with real training data—the lower bound that we For this purpose, we also generate a corpus of 1 500 labeled want to surpass. For that, we derive three non-overlapping scores for the three remainder composition methods, given partitions—train, validation, and test—corresponding to that the one for the mix algorithm is already generated. The 60%, 20%, and 20% of the 1 954 selected RISM incipits, results obtained in terms of the different considered metrics respectively, following a fivefold cross-validation scheme. are presented in Table 4. Note that the figures provided rep- We train a model using those partitions. We compare it with resent the average values for the test partition, in which the those trained with 1 500, 3 000, 15 000, and 150 000 sam- validation data achieve its best performance for each of the ples generated using the random walk method and the mix considered cases. method, respectively, when evaluated over the same RISM The results reveal the following conclusions: (1) The test partition. Table 6 reports the results obtained. sonification of the logistic equation method by itself does The results reveal an exponential decay in the various fig- not work well since about 40% of the symbols predicted ures of merit considered: while going from 1 500 to 15 000 are wrong; (2) the normal distribution method halves the improves SER by 5 points, multiplying the size of the data by SER compared to the previous method; and (3) the ran- 10 for the second time achieves an improvement of less than 1 dom walk method and mixing data from all methods (mix point. It suggests that in the second zone, we reach the plateau method) further enhance that improvement, with both meth- of the curve (see Fig. 14). In other words, there exists a glass ods depicting similar values. Such a trend is visible in all the ceiling when recognizing real scores with synthetic-trained figures of merit considered. To support the relevance of those models. The nonparametric Wilcoxon signed-rank test intro- statements, we also assess the results in terms of statistical duced in Sect. 4.1 reinforces the finding by stating that the significance following the nonparametric Wilcoxon signed- error rates are not significantly different. rank test introduced in Sect. 4.1. The analysis is reported in The lower error bound found, of around 12%, might be due Table 5. to the underlying (musical) language model of the composi- SER (%) International Journal of Multimedia Information Retrieval (2023) 12:12 Page 11 of 13 12 Table 6 Results obtained for Size of the synthetic training corpus RISM each composition method when 1 500 3 000 15 000 150 000 models are trained using different set sizes and tested on Random Mix Random Mix Random Mix Random Mix the same RISM samples as the real-only model. Best results are SER (%) 16.9 17.3 15.3 15.3 12.7 12.4 12.1 11.5 3.3 highlighted in bold type NonSep-SER (%) 25.3 26.4 22.9 23.3 18.9 18.3 17.7 16.8 4.8 Sim-SER (%) 09.7 09.7 08.7 08.9 07.6 07.3 07.4 07.2 2.5 NonSim-SER (%) 15.6 16.7 14.2 14.4 11.3 11.0 10.3 09.6 2.3 underlying problem is one of out-of-distribution learning, for which no satisfactory solution is known at this time, at least for the CRNN-CTC framework. To validate our intuition about the cause of the glass ceiling, we start from the premise that one possible solu- tion would be to add scores from the test distribution to the training set. For that, we derive three non-overlapping partitions—train, validation, and test—corresponding to L, L, and 1 954−2L of the 1 954 selected RISM incipits, respec- tively, where L is the number of randomly selected samples. We add the train and validation sets to the corresponding syn- thetic train and validation partitions of the random and mix corpora of size 150 00, respectively, and use the test set to Fig. 15 Average symbol error rate (%) attained for each scenario with evaluate such synthetic-with-real models. We compare those respect to the number of randomly selected RISM training staves, L models with a model trained only with the aforementioned RISM partitions. We want to see (i) whether the glass ceil- ing can be broken by using real samples and (ii) how many tion algorithms. The generated synthetic corpora are created of them are needed if (i) proves to be the case. It must be in a way that contains a similar number of measures per staff, noted that a fivefold cross-validation scheme is followed for symbols per staff, and vertical distributions per staff—all of this experiment to ensure that the results are not conditioned them on average—as the selected RISM corpus as well as the to the randomly selected samples. The results obtained in same graphical appearance. However, it might not be enough terms of the SER metric for the contemplated scenarios are as the synthetic data distribution fails to capture all the char- presented in Fig. 15 and Table 7. acteristics of the real data. Table 6 shows that training with First, it is necessary to state that the glass ceiling caused real homophonic scores yields an error rate of 3.3%. This by the synthesis process can be broken by incorporating real is less than one-third of the errors made by the best model scores into the synthetic train partition. This outcome sup- trained on synthetic data only. ports the initial premise that the lower error bound attained We would like to stress again how simultaneities do not with synthetic-only models was due to the synthesis process. represent a problem. When the training data correctly capture We only need 50 real scores to decrease the 17% error rate the distribution of the test data, the error rate in the simultane- to 10% and 100 to halve it—when combined with synthetic ities and in the monophonic segments is roughly the same (see data. If used on their own, the real-only model is not able RISM column in Table 6). At the same time, this reinforces to solve the transcription task. Such a model starts to do so our idea about the glass ceiling. For the rest of the models, the after 150 samples, and even so, using either only synthetic error rate is bigger in monophonic regions since the synthetic data or combining it with real data still yields better results. distribution is not properly modeling that of RISM data. The Table 7 Results obtained in Number of randomly selected RISM training staves, L terms of the symbol error rate 0 50 100 150 200 250 300 350 400 for each scenario with respect to the number of randomly RISM – 90.6 74.1 15.6 12.2 8.9 8.3 7.3 6.4 selected RISM training staves, L. Best results are highlighted in Random 16.8 10.6 08.4 07.4 06.3 5.9 5.6 5.2 4.6 bold type Mix 17.1 10.2 07.9 07.1 06.1 5.95.05.14.6 SER (%) 12 Page 12 of 13 International Journal of Multimedia Information Retrieval (2023) 12:12 This implies that manually labeling some samples is compen- to generate music scores that follow such a particular style. sated as combining synthetic and real data bring out synergies Then, the glass ceiling could be broken and the advantage that help reduce the baseline error rate set by the only-real of having an infinite data generator could be exploited to the model. Moreover, as analyzed in [3], the posterior correction fullest. of errors of the synthetic-and-real models is more than offset Funding Open Access funding provided thanks to the CRUE-CSIC by the time saved when compared with the real model, as the agreement with Springer Nature. This paper is part of the I+D+i PID latter needs more manually transcribed training samples. 2020-118447RA-I00 (MultiScore) project, funded by MCIN/AEI/10. Regarding the synthesis method, it can be seen in both 13039/501100011033. The first author is supported by grant FPU19/049 57 from the Spanish Ministerio de Universidades. Fig. 15 and Table 7 that even though their performance is quite similar for all cases, the mix method tends to achieve Data availability The datasets generated during and/or analyzed dur- slightly lower error rates. ing the current study are available from the corresponding author on reasonable request. Declarations 5 Conclusions Conflict of interests The authors declare that they have no conflicts of In this work, we have studied the suitability of the state- interest. of-the-art end-to-end neural approach to recognize real homophonic music scores by presenting and analyzing four Open Access This article is licensed under a Creative Commons different encodings for the OMR output. Throughout the Attribution 4.0 International License, which permits use, sharing, adap- tation, distribution and reproduction in any medium or format, as research, we have trained the neural network with synthetic long as you give appropriate credit to the original author(s) and the scores, created by the system of automatic generation of source, provide a link to the Creative Commons licence, and indi- labeled data. This makes the use of an infinite music data gen- cate if changes were made. The images or other third party material erator helpful in dramatically reducing the costs of acquiring in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material scores for training OMR systems. is not included in the article’s Creative Commons licence and your As reported in the first part of the experiments, our seri- intended use is not permitted by statutory regulation or exceeds the alized ways of encoding the music content prove to be permitted use, you will need to obtain permission directly from the copy- appropriate for our DL-based OMR, as the learning process right holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/. is successful, and low SER figures are eventually attained. In addition, it is shown that the choice of the encoding has some impact on the lower bound of the error rates that can be achieved: the advance-character position code is the one that most benefits the learning process in the recognition of ver- References tical structures found in a homophonic music environment. 1. Alfaro Contreras M (2018) Construcción de un corpus de referen- These facts reinforce our initial claim that the encoding of cia para investigación en reconocimiento automático de partituras the output for OMR deserves further consideration within the musicales. Technical report, Universidad de Alicante. (In Spanish) end-to-end DL paradigm. 2. Alfaro-Contreras M, Calvo-Zaragoza J, Iñesta JM (2019) It has also been possible to demonstrate that the algo- Approaching end-to-end optical music recognition for homophonic scores. In: Iberian conference on pattern recognition and image rithmic composition method used in the creation of synthetic analysis, pp 147–158. Springer music has a strong influence on the recognition results, being 3. Alfaro-Contreras M, Rizo D, Iñesta JM, Calvo-Zaragoza J (2021) the random walk method and the mix method the most suit- OMR-assisted transcription: a case study with early prints. In: Pro- able algorithmic composition techniques. However, although ceedings of the 22nd international society for music information retrieval conference, pp 35–41, Online. ISMIR the learning process was successful, there exists a glass 4. Bainbridge D, Bell T (2001) The challenge of optical music recog- ceiling when recognizing real scores: the Sym-ER never nition. Comput Humanit 35(2):95–121 decreased below 11%, regardless of largely increasing the 5. Baró A, Badal C, Fornés A (2020) Handwritten historical music size of the training set. To break the glass ceiling, the use recognition by sequence-to-sequence with attention mechanism. In: 17th International conference on frontiers in handwriting recog- of real sheet music is necessary. It indicates that there is a nition, ICFHR 2020, Dortmund, Germany, 2020, pp 205–210 part of the learning process that is not related to the graph- 6. Baró A, Riba P, Fornés A (2018) A starting point for handwritten ical aspects of the scores but to the underlying (musical) music recognition. In: 1st International workshop on reading music language model. We believe this opens up new avenues for systems. France, Paris, pp 5–6 7. Burgoyne JA, Pugin L, Eustace G, Fujinaga I (2007) A comparative research. For example, modeling more intelligent systems of survey of image binarisation algorithms for optical recognition on automatic generation of labeled data. It might be convenient degraded musical sources. In: Proceedings of the 8th international to first learn some characteristics of the language model of conference on music information retrieval, ISMIR 2007, Vienna, the music that will be recognized at a later time in order Austria, 2007, pp 509–512 123 International Journal of Multimedia Information Retrieval (2023) 12:12 Page 13 of 13 12 8. Byrd D, Simonsen JG (2015) Towards a standard testbed for optical 27. Pacha A, Calvo-Zaragoza J, Jr JH (2019) Learning notation graph music recognition: Definitions, metrics, and page images. J New construction for full-pipeline optical music recognition. In: Flexer Music Res 44(3):169–195 A, Peeters G, Urbano J, Volk A, (eds) In: Proceedings of the 20th 9. Calvo-Zaragoza J, Jr JH, Pacha A (2020) Understanding optical international society for music information retrieval conference, music recognition. ACM Comput Surv, 53(4): 1–77 ISMIR 2019, Delft, The Netherlands, 2019, pp 75–82 10. Calvo-Zaragoza J, Micó L, Oncina J (2016) Music staff removal 28. Pacha A, Eidenberger H (2017) Towards a universal music symbol with supervised pixel classification. Int J Doc Anal Recognit classifier. In: 2017 14th IAPR International conference on docu- 19(3):211–219 ment analysis and recognition (ICDAR), 2, pp 35–36. IEEE 11. Calvo-Zaragoza J, Rizo D (2018) Camera-PrIMuS: neural end-to- 29. Pedersoli F, Tzanetakis G (2016) Document segmentation and clas- end optical music recognition on realistic monophonic scores. In: sification into musical scores and text. Int J Doc Anal Recognit Proceedings of the 19th international society for music information 19(4):289–304 retrieval conference, ISMIR 2018, Paris, France, 2018, pp 248–255 30. Raphael C, Wang J (2011) New approaches to optical music recog- 12. Calvo-Zaragoza J, Rizo D (2018) Camera-PrIMuS: neural end-to- nition. In: Klapuri A, Leider C, editors, In: Proceedings of the 12th end optical music recognition on realistic monophonic scores. In: international society for music information retrieval conference, Proceedings of the 19th international society for music information ISMIR 2011, Miami, Florida, USA, October 24-28, 2011, pp 305– retrieval conference, pp 248–255, Paris, France 310 13. Calvo-Zaragoza J, Toselli AH, Vidal E (2019) Handwritten music 31. Rebelo A, Capela G, Cardoso JdS (2010) Optical recognition of recognition for mensural notation with convolutional recurrent neu- music symbols. Int J Doc Anal Recognit 13(1):19–31 ral networks. Pattern Recognit Lett 128:115–121 32. Rebelo A, Fujinaga I, Paszkiewicz F, Marçal A, Guedes C, Car- 14. Demšar J (2006) Statistical comparisons of classifiers over multiple doso J (2012) Optical music recognition: state-of-the-art and open data sets. J Mach Learn Res 7:1–30 issues. Int J Multimed Inf Retr 1:173–190 15. Dutta A, Pal U, Fornés A, Lladós J (2010) An efficient staff removal 33. Shi B, Bai X, Yao C (2017) An end-to-end trainable neural net- approach from printed musical documents. In: 20th International work for image-based sequence recognition and its application conference on pattern recognition, ICPR 2010, Istanbul, Turkey, to scene text recognition. IEEE Trans Pattern Anal Mach Intell 2010, pp 1965–1968 39(11):2298–2304 16. Fornés A, Sánchez G (2014) Analysis and recognition of music 34. Williams RJ, Zipser D (1995) Gradient-based learning algorithms scores. In: Handbook of document image processing and recogni- for recurrent networks and their computational complexity. In: tion, pp 749–774 Chauvin Y, Rumelhart DE, (eds.) Back-propagation: Theory, archi- 17. Gallego A-J, Calvo-Zaragoza J (2017) Staff-line removal with tectures and applications, 13: 433–486 selectional auto-encoders. Expert Syst Appl 89:138–148 18. Good M et al (2001) MusicXML: an internet-friendly format for sheet music. In: XML conference and expo, pp 03–04 Publisher’s Note Springer Nature remains neutral with regard to juris- 19. Graves A (2008) Supervised sequence labelling with recurrent neu- dictional claims in published maps and institutional affiliations. ral networks. PhD thesis, Technical University Munich 20. Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Con- nectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pp 369– 376, New York, NY, USA. ACM 21. Graves A, Schmidhuber J (2009) Offline handwriting recognition with multidimensional recurrent neural networks. In: Advances in neural information processing systems, pp 545–552 22. Hankinson A, Roland P, Fujinaga I (2011) The music encoding initiative as a document-encoding framework. In: Proceedings of the 12th international society for music information retrieval con- ference, pp 293–298 23. Kingma DP, Ba J (2015) Adam: a method for stochastic opti- mization. In: Bengio, Y, LeCun, Y., (eds) In: 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, 7-9, 2015, Conference Track Proceedings 24. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444 25. Mengarelli L, Kostiuk B, Vitorio JG, Tibola MA, Wolff W, Silla CN (2020) OMR metrics and evaluation: a systematic review. Mul- timed Tools Appl 79(9):6383–6408 26. Miranda E (2001) Composing music with computers. Focal Press, New York
International Journal of Multimedia Information Retrieval – Springer Journals
Published: Jun 1, 2023
Keywords: Optical music recognition; Deep learning; End-to-end recognition; Music encoding
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.