TY - JOUR AU - Baran, Samaddar, Arun AB - Abstract Enabling multiple-purpose robots to follow textual instructions is an important challenge on the path to automating skill acquisition. In order to contribute to this goal, we work with physical exercise instructions as an everyday activity domain where textual descriptions are usually focused on body movements. Body movements are a common element across a broad range of activities that are of interest for robotic automation. Developing a text-to-animation system, as a first step towards understanding language for machines, is an important task. The process requires natural language understanding (NLU) including non-declarative sentences and the extraction of semantic information from complex syntactic structures with a large number of potential interpretations. Despite a comparatively high density of semantic references to body movements, exercise instructions still contain a large amount of underspecified information. Detecting and bridging or filling such underspecified elements is extremely challenging when relying on methods from NLU alone. Humans, however, can often add such implicit information with ease, due to its embodied nature. We present a process that contains a combination of a semantic parser and a Bayesian network. It explicates the information that is contained in textual movement instructions so that an animation execution of the motion-sequences performed by a virtual humanoid character can be rendered. Human computation is then employed to determine best candidates and to further inform the models in order to increase performance adequacy. 1. INTRODUCTION Pervasive automation, be it for industrial automation in general, or for robots performing everyday activities in particular, constitutes topical research challenges [1]. Current state-of-the-art robotics still faces substantial limitations [2]. Most crucially, robots are mainly designed and programmed for specific tasks, and, therefore, limited in their scope. The same robot, even given a multiple-purpose hardware design, cannot be used for different types of work unless completely reprogrammed. At the same time, multiple-purpose robots that can perform different tasks in households or in industries in a flexible and robust manner would be beneficial for the real world and make for a considerable potential market. While considerable developments are undertaken in this area, aiming at more efficient multiple-purpose reappropriation and task execution instruction, they mostly rely on learning by demonstration with very precise and limited scope [3]. This motivates our work which ultimately aims at supporting the development of multiple-purpose robots that can perform different tasks using various sources of text-based instructions, where the main input is limited to natural language text. The automatic extraction of movement plans from textual instructions for robot automation is a novel research area [4]. The approach is promising with regard to scaling and automating task ability acquisition, since a large number of instructions for a wide range of activities are readily available online for various domains (e.g. instructables,1 WikiHow,2 etc.) that could be of great use for robotic performances [4]. As an intermediate step, we present a process and prototype system for automating the process of generating adequate digital avatar movement-sequence executions that present a prerequisite for mapping to a physical robotic body. Our exploration is based on a combination of a fully automated text-to-animation system with humans in the loop for improving results at acceptable cost. As a first proof of concept case study, we focus on creating a pipeline to extract action specifications from text-based instructions in the domain of exercise instructions. Instruction sheets that are typically employed for exercise instruction in physiotherapy, rehabilitation or prevention (PRP) form the primary input. Accordingly, we developed a Text to Animation system for Physical Exercises (TAPE). Starting with textual representations of physical exercises, TAPE automatically generates corresponding animations. Exercise instructions were chosen as an initial domain, since they contain a clear focus on body movements and individual elements of such movements which can function as building blocks for a wide variety of activities. Due to the explicit focus, physical exercise instructions can be expected to contain few elements that are not concerned with movement instructions and thus contain comparatively little noise in the form of semantic variability, which can ease the process of initially establishing the TAPE system. As an additional benefit, the resulting movement-sequence executions can be employed to provide an alternative, and potentially more engaging and more clear exercise instruction modality for the application area of PRP [5, 6]. Natural language understanding (NLU) is an important element of such an automatic text to animation system. Despite notable recent progress like WordsEye [7], Carsim [8], general purpose NLU is still a challenge for computer systems, e.g. contextualized disambiguation [9]. Different kinds of semantic parsers are available that provide adequate semantic information that is contained in sentences. While most work has been focused on understanding declarative utterances and interrogatives, the text type of instructions and the ensuing imperative forms, have received less attention. Therefore, typical semantic parsers usually give satisfying results for declarative sentence structures. We are aiming to understand imperative structures which contain a lot of implicit information, even under limiting pre-conditions, such as a specific focus on exercising. As indicated above, implicit information is difficult to extract using a semantic parser, since it often relates to contextual or experiential knowledge. Hence, humans are far better in filling the gaps, and understanding these types of textual instructions. Human computation methods can thus be employed to fill gaps in automated NLU pipelines that cannot yet be filled by digital computation alone. If the human input is not only used to facilitate task-specific solutions, but to improve the underlying models, this can further the development of more general approaches to scalable NLU. We propose a system that consists of four parts: (1) a semantic parser, (2) a Bayesian network, (3) an animation creation system and (4) human computation for validation and feedback. In the first step, semantic information is extracted using embodied construction grammar (ECG) [10]. The second step attempts a best-guess explication of a complete semantic construct, filling implicit location information using a Bayesian belief network [11]. As a third step, the system generates an animation file using an appropriate XML-based movement markup language which is then employed to generate a variable number of best candidate animation videos as an output relating to the original textual exercise instructions. As a final step, human computation serves to isolate best candidate renderings. The resulting human computation ratings can then be used to update the Bayesian network in order to improve the quality of future best candidate generation. Work in this direction makes important contributions, since natural language understanding has been a subject to a large array of research efforts, and while domain-specific solutions exist, the range of domains is limited and the understanding within these domains is usually still limited to a predefined selection of constructions. 2. BACKGROUND The multimodal text-to-animation system CONFUCIUS [12] generates animations of virtual humans from single declarative sentences containing an action verb. This system fully relies on the action verb and the subject of the sentence. But CONFUCIUS is limited to some specific action verbs only and not able to generate any instruction-based work, which is our goal to create animation from typical text based instruction sheet (which are mainly in a format of imperative sentence). Work by Chang et al. [13] for semantic parsing of text for 3D scene (3D images) generation incorporates spatial knowledge and parses natural text to a semantic representation. This system addresses inference of implicit spatial constraints by learning priors to spatial knowledge (e.g. typical positions of objects and common spatial relations). The user can interactively manipulate the generated scene with textual commands, enabling it to refine and expand learned priors [14]. Åkerberg et al. [8] introduced CarSim which is an automatic text-to-scene system for the purpose of supporting the analysis of road accidents. The system relies on a type of text where the scene has been described in full detail and is limited to only road accident category. An automatic text-to-scene conversion system called WordsEye by AT&T Labs-Research [7] aims at supporting the creation of 3D-scenes (3D images) from a textual description. It is based upon pre-designed 3D-models and poses that depict entities and actions, which is limited to 3D image not able to create any animation video. As it was noted in the introduction, the application area of physical exercise instructions offers the benefit of a rather constrained focus on body movement that was seen to be helpful with early explorations towards more generally capable text-to-animation systems aiming for broader purposes in robotics. At the same time, the area of motion-based digital applications and games for health as seen increasing attention in recent years due to the promise of offering motivation, guidance, and objective analysis, hence forming a promising additional approach towards facing the challenges causing individual and societal burden due to underlying developments such as aging societies and modern sedentary lifestyle [15]. The potential benefits and limitations can best be illustrated along the lines of exemplary projects from the application area. To develop an animation system from textual instruction, natural language understanding is very important and for that semantic parser is the first and most important part of the system. Some researchers also tried to find some novel way to find the semantics of the action verb, e.g. [16] proposed a new model which is able to use metaphoric projections of motion or action verbs to infer in real time important features of abstract plans and events. The model is mainly an active representation of motion verb that can be used for controlling real time inference in natural language understanding. The scientific community has produced a number of different syntactic and semantic parsers. The Stanford parser [17] is a syntactic parser which is primarily employed for syntactic part-of-speech tagging. In general terms, the range of semantic parsers goes from shallow semantic parsers using support vector machines [18] via semantic role labellers [19] to fully symbolic semantic parsers, such as the SEMAFOR parser [20]. SEMAFOR is one of the well-known semantic parsers, also known as a frame semantic probabilistic parser, using Frame Net [21] as a knowledge source. The parser tries to analyze the semantic information of the sentence using different Framenet frames, e.g. Person, Role, Performer, People_by_age, etc. We also need to extract the textual instructions in different frames by which we can achieve our text to animation goal, which is not able to extract using SEMAFOR or typical semantic parser. Beyond semantic PoS tagging, a semantic interpretation requires the application of a grammar, which can also be understandable for machine and robot and we can use it in our long-term goal which is to design a multiple-purpose robot. Construction grammars [22] are a primarily linguistic approach for the semantic and pragmatic analysis of a language. Construction grammars are mainly used in cognitive linguistics [23], cognitive grammar [24] and radical construction grammar [25]. ECG [26] and fluid construction grammar (FCG) [27] are the two existing formal construction grammars that are used for semantic parsing. ECG and FCG both are open-source and mainly designed for robotics work, focusing on our long-term goal we found that we can build our grammar with any one of these two. Animation has a considerable influence on modern society. Everyone from young children to older adults is influenced by animation for different purposes, e.g. education, entertainment, etc. In research, animation techniques are also playing a role in communication, as well as in the form of a research topic in and of itself. Text to animation system is of interest to even more application areas beyond robotics or games for health, e.g. animation used in film etc. Starting from a traditional, typically 2D animation, nowadays many different types of animation techniques are available (e.g. 3D animation, stop-motion animation, keyframe animation, kinematic animation, mechanical animation, Puppetry Animation, Clay animation, Sand Animation, etc.). Next to the aforementioned areas of robotics and games for health, a system like TAPE can also contribute to the general field of animation by producing first draft motions (e.g. using screenplay or stageplay scripts as the textual basis) that could then be improved by animators. This is possible since game or film animation production use very similar formalisms for defining and processing animations that can also be translated to formats preferred by the robotics community. Nowadays, a number of markup languages exist for specifying animations. Due to the unique mix of considerations that are important when considering both the immediate application area of text-to-generic-virtual-animation and the later application target of text-to-robotic-enaction you aimed to find a formalism that would allow you to serve the needs of both application targets, therefore, a selection of markup languages which can create animation are analyzed and shown in Table 1. Table 1. List of MarkUp languages. S. no. Name Speciality Reference 1 Virtual Reality Modeling Language Web, Humanoid animation [28] 2 Humanoid animation Humanoid animation [29] 3 Multimodal Utterance Representation Markup Language Humanoid animation [30] 4 Behavior Markup Language Humanoid animation [31] 5 Signing Gesture Markup Language Sign Language [32] 6 Character Markup Language Figure base animation not humanoid [33] 7 Affective Presentation Markup Language Facial animation [34] 8 Expressive MOTion Engine Arm movement [35] 9 Hamburg Notation System for Sign Languages Sign Language [36] 10 Human Markup Language Emotion [37] 11 Avatar Markup Language Facial and body animation [38] 12 Multi-modal Presentation Markup Language Facial and Hand movements [39] 13 Multimodal Presentation Markup Language for Virtual Reality Facial and Hand movements [40] 14 Scripting Technology for Embodied Persona Humanoid [41] 15 Behavior Expression Animation Toolkit Behavior [42] 16 XML- based Markup Language for Embodied Agents Humanoid [41] 17 Improv Behavioral [43] 18 Virtual Human Markup Language Facial Animation, Body Animation [44] 19 Solid Agents in Motion Exchanging messages [45] S. no. Name Speciality Reference 1 Virtual Reality Modeling Language Web, Humanoid animation [28] 2 Humanoid animation Humanoid animation [29] 3 Multimodal Utterance Representation Markup Language Humanoid animation [30] 4 Behavior Markup Language Humanoid animation [31] 5 Signing Gesture Markup Language Sign Language [32] 6 Character Markup Language Figure base animation not humanoid [33] 7 Affective Presentation Markup Language Facial animation [34] 8 Expressive MOTion Engine Arm movement [35] 9 Hamburg Notation System for Sign Languages Sign Language [36] 10 Human Markup Language Emotion [37] 11 Avatar Markup Language Facial and body animation [38] 12 Multi-modal Presentation Markup Language Facial and Hand movements [39] 13 Multimodal Presentation Markup Language for Virtual Reality Facial and Hand movements [40] 14 Scripting Technology for Embodied Persona Humanoid [41] 15 Behavior Expression Animation Toolkit Behavior [42] 16 XML- based Markup Language for Embodied Agents Humanoid [41] 17 Improv Behavioral [43] 18 Virtual Human Markup Language Facial Animation, Body Animation [44] 19 Solid Agents in Motion Exchanging messages [45] View Large Table 1. List of MarkUp languages. S. no. Name Speciality Reference 1 Virtual Reality Modeling Language Web, Humanoid animation [28] 2 Humanoid animation Humanoid animation [29] 3 Multimodal Utterance Representation Markup Language Humanoid animation [30] 4 Behavior Markup Language Humanoid animation [31] 5 Signing Gesture Markup Language Sign Language [32] 6 Character Markup Language Figure base animation not humanoid [33] 7 Affective Presentation Markup Language Facial animation [34] 8 Expressive MOTion Engine Arm movement [35] 9 Hamburg Notation System for Sign Languages Sign Language [36] 10 Human Markup Language Emotion [37] 11 Avatar Markup Language Facial and body animation [38] 12 Multi-modal Presentation Markup Language Facial and Hand movements [39] 13 Multimodal Presentation Markup Language for Virtual Reality Facial and Hand movements [40] 14 Scripting Technology for Embodied Persona Humanoid [41] 15 Behavior Expression Animation Toolkit Behavior [42] 16 XML- based Markup Language for Embodied Agents Humanoid [41] 17 Improv Behavioral [43] 18 Virtual Human Markup Language Facial Animation, Body Animation [44] 19 Solid Agents in Motion Exchanging messages [45] S. no. Name Speciality Reference 1 Virtual Reality Modeling Language Web, Humanoid animation [28] 2 Humanoid animation Humanoid animation [29] 3 Multimodal Utterance Representation Markup Language Humanoid animation [30] 4 Behavior Markup Language Humanoid animation [31] 5 Signing Gesture Markup Language Sign Language [32] 6 Character Markup Language Figure base animation not humanoid [33] 7 Affective Presentation Markup Language Facial animation [34] 8 Expressive MOTion Engine Arm movement [35] 9 Hamburg Notation System for Sign Languages Sign Language [36] 10 Human Markup Language Emotion [37] 11 Avatar Markup Language Facial and body animation [38] 12 Multi-modal Presentation Markup Language Facial and Hand movements [39] 13 Multimodal Presentation Markup Language for Virtual Reality Facial and Hand movements [40] 14 Scripting Technology for Embodied Persona Humanoid [41] 15 Behavior Expression Animation Toolkit Behavior [42] 16 XML- based Markup Language for Embodied Agents Humanoid [41] 17 Improv Behavioral [43] 18 Virtual Human Markup Language Facial Animation, Body Animation [44] 19 Solid Agents in Motion Exchanging messages [45] View Large As we can see from the table that there are different types of markup languages are available to create different types of animation, e.g. humanoid, facial, arm, sign, body animation. But to generate an animation for physical exercise, we mainly require humanoid animation. At the start of the project, we settled on behavior markup language, which facilitates humanoid animation. But, in the future we are aiming to extend the system to support further humanoid animation systems, such as VRML or H-Anim in order to create animation without any markup language platform dependency. 3. QUALITY ASSESSMENT UNDER DIFFERENT VISUALIZATIONS In order to facilitate investigations into the topic of text-to-animation techniques for exercise instructions, we developed a Physical Exercise Instruction Sheet Corpus (PEISC) of around 1000 physical exercise instructions drawn from a number of publicly available databases, e.g. sparkpeople,3 bodbot,4 yogajournal,5 etc. On the basis of different primary body actions, we structured the full set along a number of categories, such as standing, seating and lying. We performed a case study in order to determine which visualization modality would lead to the best quality of motion execution judgments by human computation workers, which makes an important prerequisite if such judgments are to be employed to determine the outcomes of automatic motion execution candidates. To this end, we chose five exercises (cf. Table 2) which do not require any additional equipment and fall under the standing category, since exercises from this category are less likely to suffer from strong detection noise due to covered body parts when using optical sensing. The five exercises that we have recorded are squats, lateral lunges, standing iliotibial band stretch, forward lunges and reverse lunges. Table 2. Best performer per exercise and the respective inter-rater agreement on the positioning regarding the exercise execution accuracy. Exercise Performer Kappa Squats Performer 1 0.51 Lateral Lunges Performer 2 0.59 Standing IT Performer 6 0.53 Forward Lunges Performer 1 0.74 Reverse Lunges Performer 6 0.57 Exercise Performer Kappa Squats Performer 1 0.51 Lateral Lunges Performer 2 0.59 Standing IT Performer 6 0.53 Forward Lunges Performer 1 0.74 Reverse Lunges Performer 6 0.57 View Large Table 2. Best performer per exercise and the respective inter-rater agreement on the positioning regarding the exercise execution accuracy. Exercise Performer Kappa Squats Performer 1 0.51 Lateral Lunges Performer 2 0.59 Standing IT Performer 6 0.53 Forward Lunges Performer 1 0.74 Reverse Lunges Performer 6 0.57 Exercise Performer Kappa Squats Performer 1 0.51 Lateral Lunges Performer 2 0.59 Standing IT Performer 6 0.53 Forward Lunges Performer 1 0.74 Reverse Lunges Performer 6 0.57 View Large Using a Kinect device, we recorded five exercises, each executed by seven participants (three male and four female; 15–35 years of age, M=25 ⁠, SD=5 ⁠). Ten iterations of every exercise were recorded from each participant with exercise sets following a random order. Prior to the recording of each exercise we only provided instruction sheets and asked the participants to perform their interpretation of the exercises without any priming regarding how to perform them. This served to validate the assumption that some differences due to individual interpretations would arise and to facilitate testing whether quality of motion execution judgments could later on be used to determine best candidate executions based on the different visualization modalities. We also collected basic demographic information and following each exercise we collected responses regarding the comprehensibility of the instruction sheets through questionnaire items about the understanding of natural textual instructions and also about the difficulty to perform about those exercise. From the recording session and the questionnaire data, we found that sometimes the same exercise understandability differ from people to people. With these results, we can claim that exercise instruction sheets are a suboptimal modality for delivering exercise instructions [6]. The recording sessions also showed that human actors could be employed to gather digital skeleton executions of the exercises that could potentially be mapped to instructions for movement executions on robotic platforms. However, this still requires considerable manual effort, does not flexibly map to other activity domains and requires special hardware. Therefore, an automated system for movement extraction from textual instructions is deemed helpful to facilitate scalability. We have developed four different categories of visualization modalities from the collected Kinect data (namely RGB, depth, skeleton, virtual reality, shown in Fig. 1) together with a survey application, aiming to crowdsource the assessment of the quality of exercise executions and to determine the best visualization modality for high inter-rater agreement. Following the quality assessment survey, we provided a questionnaire to gather comparative responses on items expressing preferences regarding the visualization type, movement quality of different body parts during the performance of the exercises, and to acquire additional demographic data. The outcomes of this study were meant to inform the modality with which to present motion candidate videos for the validation of the output of the automated text to animation pipeline. Figure 1. View largeDownload slide Top→Exerciseinstruction sheet; Right→Screenshotof survey application with four different visualizations. Figure 1. View largeDownload slide Top→Exerciseinstruction sheet; Right→Screenshotof survey application with four different visualizations. In the survey application, participants were asked to read each instruction sheet followed by watching the videos of all seven performances from the exploratory study in all four visualization modalities. All exercises and categories were presented in a random order on the screen. During testing, we found that participants found it difficult to determine the best performance, but picking out the worst one was usually easier. Therefore, we chose a procedure where the participants were tasked to delete the exercise execution with the worst quality and to repeat that procedure until the best one remained and this was done for each of the exercises under all four different modalities. 3.1. Results For the exploratory evaluation of the tool and of the quality of motion assessment approach, 20 participants completed the survey and questionnaires. It was carried out using a downloadable application (a link to the application was spread via snowball-sampling). With the help of Kappa statistics [46], we calculated the best performer of all five exercises (shown in Table 2) and the best visualization type (displayed in Fig. 2) based on the data that we acquired from the crowdsourcing tool evaluation study [47–49]. We found that RGB is the best performing and most preferred modality and that depth and skeleton are the worst under both measures. As an outcome from the results, we can say that VR and RGB are the best liked visualizations and also allow for the highest inter-rater agreement. Since convincing RGB animations that appear like actual video recordings of a human actor would be very difficult to produce automatically, we settled in the VR visualization modality for the following work. Figure 2. View largeDownload slide Agreement for different visualizations. Figure 2. View largeDownload slide Agreement for different visualizations. 4. PROPOSED SYSTEM Based on the current state of the art, we set out a pipeline for extracting validated movements from instruction sheets as shown in Fig. 3. To generate animations from instructions, we propose a series of steps with iterative loops for improved results. The pipeline can be divided into four discrete steps which can be summarized as follows: Step 1: In this step, we extract the basic semantic information to generate an animation from textual instructions. In its current state, the system extracts three different types of information, namely: Actions—different types of actions, such as lift, tilt, etc.; Body parts—which are prominently involved in the exercise, e.g. shoulders, legs, etc.; Location—where body part are to be moved to (focusing primarily on destination locations). Step 2: If any elements from the three types of information above are missing or underspecified, the system will refer to the Bayesian network to attempt to extract implicit semantic information. Step 3: After extracting all required information from Steps 1 and 2, the system automatically generates an animation file using behavior markup language and generates a video based on the animation execution. Step 4: We did not mention this step in Fig. 3, this is our fourth step which is in our future’s agenda, in this step using human computation approach we will try to determine the best candidate or results as an output of the system. Figure 3. View largeDownload slide Pipeline for generation virtual movement executions from textual instructions. Figure 3. View largeDownload slide Pipeline for generation virtual movement executions from textual instructions. As we mentioned above, language understanding is an important part of the development of an animation system that takes textual exercise instruction sheets as input. The implicit information contained in these exercise instructions is the most difficult part to extract automatically. Using existing systems it is not possible to extract the implicit information from the exercise instructions, because semantic parsers can only extract information which is present in the sentence, they do not have any cognitive knowledge to extract the implicit information. Therefore, we augmented a natural language understanding system in order to extract the required implicit information from the given exercise instruction sheets. Currently, our system is limited to single sentence exercise instructions only, such as lift your left arm. If information on the action, body part or location is missing the system moves to the next step, a Bayesian network that sets the extracted information as an evidence in the model and extrapolates the implicit information which is missing from the semantic analysis. In our physical exercise instruction sheet corpus, we found that exercise instructions with single poses frequently contain body parts and actions, but locations are often left implicit, or are explicated in accompanying images only. 4.1. Semantic parser A semantic parser makes for the first part of our system. Semantic information is extracted from the exercise instruction texts, isolating action, body part and location. Two different semantic parsers for extracting this information were appropriated and tested. Both approaches were found to work properly for our needs. 4.1.1. Using a constructional analyzer Construction grammar is mainly a linguistic theory which is designed based on speakers knowledge [22]. Construction grammar basically consists of two parts: form and meaning ⁠. Form is used to describe syntactic, morphological or prosodic patterns and meaning describes lexical semantics, pragmatics and discourse structure [50]. In Fig. 4, we show examples of form and meaning that are used for exercise instructions. As mentioned earlier, this is mainly used in cognitive linguist. To understand the natural language text of the exercise instruction sheet for our text to animation system, we employed embodied construction grammar (ECG) because unlike other construction grammar it is not limited to mapping between phonological forms and conceptual representations [26]. Figure 4. View largeDownload slide Examples of form and meaning ⁠; symbolic representations of specific manifestations of the meaning that relates to the lexical forms. Figure 4. View largeDownload slide Examples of form and meaning ⁠; symbolic representations of specific manifestations of the meaning that relates to the lexical forms. We wrote our own construction grammar to analyze text bases physical exercise instructions. In Fig. 5, we show a specific instance of a construction grammar for BendShoulder ⁠. As shown in Fig. 5, BendShoulder contains three main elements: constituents, form and meaning. In BendShoulder ⁠, there are two different constituents BEND and SHOULDER labeled as b and s respectively. The form part consists of bf before sf ⁠, which mean BEND will always take place before SHOULDER. The meaning part consists of bm.bend↔sm Figure 5. View largeDownload slide left: Construction Grammar for BendShoulder; right: Results from the Embodied Construction Grammar. Figure 5. View largeDownload slide left: Construction Grammar for BendShoulder; right: Results from the Embodied Construction Grammar. In our grammar, we merged different similar construction grammar into one category, e.g. different bodyparts, different action verbs, etc. as shown in Fig. 6, there are two different construction grammars named BendShoulder and BendElbow with the constituents Bend, Shoulder and Bend, Elbow. Both shoulder and elbow belong to the human bodyparts. Therefore, we merged BendShoulder and BendElbow into one category named BendBodypart that consists of Bend and Bodypart constituents. For shoulder and elbow, we designed different construction grammar elements, making for a subcase of Bodypart as shown in Fig. 6. Figure 6. View largeDownload slide Merging BendShoulder and BendElbow. Figure 6. View largeDownload slide Merging BendShoulder and BendElbow. In this approach, we used a constructional analyzer to extract the semantic information. There are different types of formal construction grammars, but here we used Embodied Construction Grammar (ECG) to run our own constructional analyzer. We analyzed the sentence structures in our PEISC corpus and developed our own construction grammar, covering mainly exercise instructions. In Fig. 5, we show the result from our construction grammar using ECG for Example 1 ‘lift your right arm’. So, from the results we found basically we are getting all the required information (action, bodypart, location) to create animation from textual instruction sheet if those are available in the textual instruction. 4.1.2. Using a rule-based parser In this approach, we developed a semantic parser using the Stanford syntactic parser. We set different rules for the semantic parser based on analyzing the sentence structure of the PEISC corpus. Then we used different frames to analyze the meanings of different phrases or words of the instructions. While we added some frames by our own (e.g. Action, Bodypart, Destination, Direction, etc.) for the analysis of exercise instructions, but maximum basic frames were adopted from Frame Net [21]. In our semantic parser, the instructions are initially passed through the Stanford syntactic parser, then the output is matched with different rules of our semantic parser that we have set before as mentioned earlier. At last, the system gives an output with their most possible matching frames, as shown in Fig. 7. For two different instructions, ‘lift your right arm’ and ‘bring your hands toward your shoulder’, the ensuing results are shown in Fig. 8. Action, Bodypart, Direction (direction of the location) and Destination (destination location) are additional frames that were tested for these two instructions, as displayed in Fig. 8. Like the construction grammar we found that using our proposed semantic parser, it is also able to extract all the require information (action, bodypart, location) to create animation from textual instruction sheet. Figure 7. View largeDownload slide Framework of rule-based semantic parser. Figure 7. View largeDownload slide Framework of rule-based semantic parser. Figure 8. View largeDownload slide Results from our rule-based semantic parser for instruction: left: lift your right arm; right: bring your hands towards your shoulders. Figure 8. View largeDownload slide Results from our rule-based semantic parser for instruction: left: lift your right arm; right: bring your hands towards your shoulders. 4.2. Bayesian network After acquiring the semantic structure of the instruction, the system must assure that adequate actionable information that allows for mapping to body movements is present. This frequently encompasses information that is implicitly contained in the exercise description and cannot be recovered with the semantic parsing methods discussed above. We explored the application of Bayesian networks for explicating the implicit or hidden information in the given exercise instructions. Our Bayesian network consists of three different variables and 50 different values as shown below and displayed in Fig. 9 and listed in Table 3. Figure 9. View largeDownload slide Left→Bayesian model for hidden information, i.e. ‘location’; Right→locations of the body. Figure 9. View largeDownload slide Left→Bayesian model for hidden information, i.e. ‘location’; Right→locations of the body. Table 3. The different possible values of the variables. Action Bodypart Location lift l_hip location 1 location 2 bring r_hip location 3 location 4 bend l_knee location 5 location 6 push r_knee location 7 location 8 keep l_ankle location 9 location 10 bent r_ankle location 11 location 12 tilt l_shoulder location 13 location 14 rest r_shoulder location 15 location 16 lower l_elbow location 17 location 18 stretch r_elbow location 19 location 20 pull l_wrist location 21 location 22 sit r_wrist reach HumanoidRoot raise skullbase(head) Action Bodypart Location lift l_hip location 1 location 2 bring r_hip location 3 location 4 bend l_knee location 5 location 6 push r_knee location 7 location 8 keep l_ankle location 9 location 10 bent r_ankle location 11 location 12 tilt l_shoulder location 13 location 14 rest r_shoulder location 15 location 16 lower l_elbow location 17 location 18 stretch r_elbow location 19 location 20 pull l_wrist location 21 location 22 sit r_wrist reach HumanoidRoot raise skullbase(head) View Large Table 3. The different possible values of the variables. Action Bodypart Location lift l_hip location 1 location 2 bring r_hip location 3 location 4 bend l_knee location 5 location 6 push r_knee location 7 location 8 keep l_ankle location 9 location 10 bent r_ankle location 11 location 12 tilt l_shoulder location 13 location 14 rest r_shoulder location 15 location 16 lower l_elbow location 17 location 18 stretch r_elbow location 19 location 20 pull l_wrist location 21 location 22 sit r_wrist reach HumanoidRoot raise skullbase(head) Action Bodypart Location lift l_hip location 1 location 2 bring r_hip location 3 location 4 bend l_knee location 5 location 6 push r_knee location 7 location 8 keep l_ankle location 9 location 10 bent r_ankle location 11 location 12 tilt l_shoulder location 13 location 14 rest r_shoulder location 15 location 16 lower l_elbow location 17 location 18 stretch r_elbow location 19 location 20 pull l_wrist location 21 location 22 sit r_wrist reach HumanoidRoot raise skullbase(head) View Large Variables, values and the conditional probability table (CPT) compose the parts of the Bayesian network. To build a reliable Bayesian network, an appropriate and reliable CPT is the core challenge. At the start, the model was informed by word frequencies from our PEISC corpus [47], to build the CPT as shown in Fig. 10. Figure 10. View largeDownload slide Bayesian network for lift your left arm. Figure 10. View largeDownload slide Bayesian network for lift your left arm. In this form, if the system is employed to analyze exercise instructions such as lift your left arm, the text will first go through the semantic parser and extract the following information: Action: lift Bodypart: left arm Action: lift Bodypart: left arm Action: lift Bodypart: left arm Action: lift Bodypart: left arm The extracted information lift and left arm is set as evidence in the Bays network as shown in Fig. 10, leading to the extraction of the probable implicit destination location. However, here the result contains three different probable locations that means three different co-ordinates, i.e. location 2, location 3 and location 17 as shown in Table 4. Accordingly, we updated the CPT of the Bayesian network using crowdsourcing, which we are going to analyze in Section 4.3. Table 4. Results before and after survey. Lift your left arm Before Survey (%) After Survey (%) location 2 (32.7) location 2 (22.7) location 3 (32.7) location 3 (44.7) location 17 (32.7) location 17 (30.7) Lift your left arm Before Survey (%) After Survey (%) location 2 (32.7) location 2 (22.7) location 3 (32.7) location 3 (44.7) location 17 (32.7) location 17 (30.7) View Large Table 4. Results before and after survey. Lift your left arm Before Survey (%) After Survey (%) location 2 (32.7) location 2 (22.7) location 3 (32.7) location 3 (44.7) location 17 (32.7) location 17 (30.7) Lift your left arm Before Survey (%) After Survey (%) location 2 (32.7) location 2 (22.7) location 3 (32.7) location 3 (44.7) location 17 (32.7) location 17 (30.7) View Large 4.2.1. Automatic updates of the Bayesian network As mentioned above, the proposed Bayesian network features three variables. The variable action contains of 14 possible values. This means that the system will only work automatically if the exercise instruction contains any of these 14 values. But, if the exercise instruction contains some other action verb the system will fail, e.g. if there is an exercise instruction turn your head, the system will fail because turn is not a value in the action variable. Therefore, we designed the network in such a manner that the action variable will automatically update if it encounters an exercise description with an action verb that is not yet a value of the action variable. Hence, when facing the expression turn your head, where turn is not a value of the action variable, the system automatically adds turn as a value in the action variable and updates the Bayesian network as shown in Fig. 11. Figure 11. View largeDownload slide Add turn as a value in the action variable. Figure 11. View largeDownload slide Add turn as a value in the action variable. 4.3. Automatic CPT update using crowds To design a reliable and scalable Conditional Probability Table (CPT), we update our CPT using crowdsourcing [51]. To this end, we developed a system implemented in Java and using SamIam system6 for the Bayesian network. In the future, we will implement this sub-system in to the fourth stage of our framework, which is a human computation approach. In this study, we asked people to rate 13 different exercises presented through 44 different exercise videos, where every exercise represented by three to four candidate videos that contained different enactions of potential target destinations. The destination locations are basically assumed as per different possible co-ordinates of that bodypart. The participants are asked to rate the videos on a scale of 1–5, where 1 stand for best and 5 for the worst execution as per the instructions of the exercise. The original written instructions of all 13 exercises do not explicitly mention the destination location of the bodypart. Every exercise also represents precisely one bodypart or human joint (which are the values of Bodypart variable in the Bayesian network). The list of the 13 exercises, together with the corresponding bodyparts, is provided in Table 5. Table 5. List of exercises with corresponding body parts contained in the Bayesian network. Num. Exercise Corresponding body part NO. of videos 1 Lift your left arm l_shoulder 3 2 Lift your right arm r_shoulder 3 3 Bend your left ankle l_ankle 3 4 Bend your right ankle r_ankle 3 5 Bend your left knee l_knee 4 6 Bend your right knee r_knee 4 7 Bend your left wrist l_wrist 3 8 Bend your right wrist r_wrist 3 9 Bend your left leg l_hip 4 10 Bend your right leg r_hip 4 11 Bend your left elbow l_elbow 3 12 Bend your right elbow r_elbow 3 13 Tilt your head skullbase 4 Num. Exercise Corresponding body part NO. of videos 1 Lift your left arm l_shoulder 3 2 Lift your right arm r_shoulder 3 3 Bend your left ankle l_ankle 3 4 Bend your right ankle r_ankle 3 5 Bend your left knee l_knee 4 6 Bend your right knee r_knee 4 7 Bend your left wrist l_wrist 3 8 Bend your right wrist r_wrist 3 9 Bend your left leg l_hip 4 10 Bend your right leg r_hip 4 11 Bend your left elbow l_elbow 3 12 Bend your right elbow r_elbow 3 13 Tilt your head skullbase 4 View Large Table 5. List of exercises with corresponding body parts contained in the Bayesian network. Num. Exercise Corresponding body part NO. of videos 1 Lift your left arm l_shoulder 3 2 Lift your right arm r_shoulder 3 3 Bend your left ankle l_ankle 3 4 Bend your right ankle r_ankle 3 5 Bend your left knee l_knee 4 6 Bend your right knee r_knee 4 7 Bend your left wrist l_wrist 3 8 Bend your right wrist r_wrist 3 9 Bend your left leg l_hip 4 10 Bend your right leg r_hip 4 11 Bend your left elbow l_elbow 3 12 Bend your right elbow r_elbow 3 13 Tilt your head skullbase 4 Num. Exercise Corresponding body part NO. of videos 1 Lift your left arm l_shoulder 3 2 Lift your right arm r_shoulder 3 3 Bend your left ankle l_ankle 3 4 Bend your right ankle r_ankle 3 5 Bend your left knee l_knee 4 6 Bend your right knee r_knee 4 7 Bend your left wrist l_wrist 3 8 Bend your right wrist r_wrist 3 9 Bend your left leg l_hip 4 10 Bend your right leg r_hip 4 11 Bend your left elbow l_elbow 3 12 Bend your right elbow r_elbow 3 13 Tilt your head skullbase 4 View Large The system was updated using the following heuristic: if a participant rated a video with the quality level 1 then CPT of the Location(correspondence with that bodypart) variable decreased by 25% of the remaining videos or unclicked videos(two or three videos) and add those decreased amount with the evidence which is corresponding to the clicked video. If a participant clicks 2, 3 or 4, then the CPT decreases, respectively, 20%, 15% and 10% and for a rating 5 the CPT remains same without any change. Figure 12 shows how the system looks for 3 and 4 videos. Figure 12. View largeDownload slide Automatic CPT update application; left: bend your right leg(r_hip); right: lift your left arm(l_shoulder). Figure 12. View largeDownload slide Automatic CPT update application; left: bend your right leg(r_hip); right: lift your left arm(l_shoulder). Considering, for example, the expression ‘lift your left arm’, the model originally contained three different possible destination locations with same probability as shown in Fig. 10. In Fig. 12, every video is presented together with a rating scale (1–5). If a participant rated a video in between 1, 2 or 3, then the probability of the other two location value of the Location variable which represented in the candidate videos decreases 25%, 20% and 15%, respectively, and add those amount in the rated video (location). If a participant rated a candidate position video with a 4 or 5, the CPT was not adjusted. In cases with four possible locations, the system produces four possible videos as shown in Fig. 12. In that case if someone rated the video 4, then it changes by 10% with still no change for a rating of 5. Using this approach, the probabilities encoded in the network change based on crowds input. For example, the probability for location 3 for ‘lift your left arm’ increased from 32.7% to 44.7% after updating the CPT using crowds as shown in Table 4. A total of 32 participants participated (the participants are mainly students who said that they regularly perform physical exercises), in our study. The study was performed offline to keep track of the records of the participants. For two exemplary exercises, the results are shown in Fig. 13. Figure 13. View largeDownload slide Study responses of Bayesian network lift your left arm and bend your left knee. Figure 13. View largeDownload slide Study responses of Bayesian network lift your left arm and bend your left knee. 4.3.1. Outcomes In the following, we are going to evaluate how our whole system is working for an exercise instruction with implicit information. If we take the example of lift your right arm, the system first extracts the semantic information of the instruction using the semantic parser, as shown in Figs 5 and 8. From the first part of our system, we get the action (i.e. lift) and the body part (i.e. right arm). However, the remaining type (location) is missing after the semantic analysis. Therefore, the system moves to the second part, i.e. the Bayesian network, to extract the implicit location for the destination of the exercise. The system automatically sets lift for the action variable and r_shoulder (right arm) for the body part variable as evidence for the Bayesian network as shown in Fig. 10. The most probable level of the implicit variable for location (destination) is automatically determined using this system, and returns location 3 (overhead) in this example as shown in Fig. 10. The overall result of the instruction lift your right arm is shown below: Action verb: lift Bodypart: your right arm Location: overhead Action verb: lift Bodypart: your right arm Location: overhead Action verb: lift Bodypart: your right arm Location: overhead Action verb: lift Bodypart: your right arm Location: overhead Also, the results of the system for a list of exercises where locations are not mentioned is shown in Table 6. With the limited fourteen bodyparts we get almost accurate location which is implicit in the exercise instruction. In case the sentence structure of the exercise instruction is very complex to understand, we get some unsatisfying result, but that also we will try to make accurate in future using human computation approach. Table 6. List of exercises with their corresponding results. Exercise Action Bodypart Location Bend your left ankle Bend your left ankle location 10 Tilt your head Tilt your head location 1, location 2 Stretch your leg Stretch your leg location 10 Bend your right elbow Bend your right elbow location 1 Raise your left shoulder Raise your left shoulder location 3 Lower your head Lower your head location 11 Bring your left arm toward your shoulder Bring your left arm location 1 Bring your right arm toward your shoulder Bring your right arm location 2 Push your right leg toward opposite Push your right leg location 10 Push your left leg toward opposite Push your left leg location 9 Exercise Action Bodypart Location Bend your left ankle Bend your left ankle location 10 Tilt your head Tilt your head location 1, location 2 Stretch your leg Stretch your leg location 10 Bend your right elbow Bend your right elbow location 1 Raise your left shoulder Raise your left shoulder location 3 Lower your head Lower your head location 11 Bring your left arm toward your shoulder Bring your left arm location 1 Bring your right arm toward your shoulder Bring your right arm location 2 Push your right leg toward opposite Push your right leg location 10 Push your left leg toward opposite Push your left leg location 9 View Large Table 6. List of exercises with their corresponding results. Exercise Action Bodypart Location Bend your left ankle Bend your left ankle location 10 Tilt your head Tilt your head location 1, location 2 Stretch your leg Stretch your leg location 10 Bend your right elbow Bend your right elbow location 1 Raise your left shoulder Raise your left shoulder location 3 Lower your head Lower your head location 11 Bring your left arm toward your shoulder Bring your left arm location 1 Bring your right arm toward your shoulder Bring your right arm location 2 Push your right leg toward opposite Push your right leg location 10 Push your left leg toward opposite Push your left leg location 9 Exercise Action Bodypart Location Bend your left ankle Bend your left ankle location 10 Tilt your head Tilt your head location 1, location 2 Stretch your leg Stretch your leg location 10 Bend your right elbow Bend your right elbow location 1 Raise your left shoulder Raise your left shoulder location 3 Lower your head Lower your head location 11 Bring your left arm toward your shoulder Bring your left arm location 1 Bring your right arm toward your shoulder Bring your right arm location 2 Push your right leg toward opposite Push your right leg location 10 Push your left leg toward opposite Push your left leg location 9 View Large 4.4. Animation generation Animation creation is the third step of our model as shown in Fig. 3. Using the information on action, body part and location that was retrieved from the first two steps, the system automatically generates an animation file. Currently, this step employs the Behavior Markup Language (BML) [31], which runs on the Artificial Social Agent Platform (ASAP) framework [52] to generate the targeted exercise animation. 4.4.1. ASAP and BML ASAP was mainly designed to facilitate human like fluid conversation with embodied conversational agents. The ASAP architecture consists of two central parts as shown in Fig. 14. The ASAP structure consists of two parts: the behavior generation sub-system in the left side and the right side consists of behavior processing sub-system. Figure 14. View largeDownload slide ASAP Framework [52]. Figure 14. View largeDownload slide ASAP Framework [52]. BML is an XML language for controlling verbal and nonverbal behavior embodied conversational agents (ECA). BML documents are usually written inside a block as shown in Fig. 15. elements are used for keyframe animations with humanoid characters and correspond to human joints or body parts, such as left arm (l_shoulder) as indicated in Fig. 15. represents time required to complete one pose mentioned in posture element, e.g. represents that to complete the first pose it will take 5 s. The system automatically generate this type of XML file as shown in Fig. 15. Figure 15. View largeDownload slide A BML example for lift your left arm. Figure 15. View largeDownload slide A BML example for lift your left arm. 4.5. Animation validation After creating the animation, the system will validate the animation with the help of the crowds, which is a part of the fourth step of our system. If the crowd found that the animation is not properly generated as per the text based exercise instruction, then crowd will able to give a feedback about the source and destination location or co-ordinate and also about the related bodypart. As per the crowds input, the system automatically updates the animation and moves to again validation step until the crowds rate the animation as a technically perfect output. 4.6. System output Currently, our text to animation system for physical exercise produces results for single sentence exercise instructions consists of single action or motion. The output animation of the exercise instruction is very satisfying as per the movement of their related bodypart and co-ordinate of destination location. Step by step result for the system how it works as per our proposed framework for exercise instruction lift your right arm is shown in Fig. 16. The accuracy of the output animation is very satisfying until we get some unsatisfied location from the previous step. But in case if we will get some unsatisfying output also we will try to make it accurate in future using human computation loop. Figure 16. View largeDownload slide Result of text to animation system for exercise instruction lift your right arm. Figure 16. View largeDownload slide Result of text to animation system for exercise instruction lift your right arm. 5. DISCUSSION AND FUTURE WORK Analyzing instructions is not an easy task for computers. In this paper, we developed a system called TAPE (text to animation system for physical exercises). The model consists of a semantic parser for determining explicit expressions, a Bayesian model for filling any gaps that may result from implicit information, and an animation system based on a markup language to generate animation candidates. Also a human computation step is also included as a fourth or last step for the validation of the animation, which is basically the future step of our system. Currently, the TAPE system is limited to single sentence instructions. The first and the second step of the TAPE system can extract implicit information in any expression containing these three different types (Action, Bodypart, Location). In future work, we will aim at extending our system for multiple action and sentence instructions and try to facilitate filling a broader array of types of potential implicit information. We are also working on a further prototype that contains the ability to extend the CPT with new variable levels, if actions are encountered that are not yet supported by the network. 6. CONCLUSION While multiple-purpose robotic technology is promising a bright future, the realization of such systems still faces considerable challenges. Among these challenges, knowledge acquisition and action validation are major concerns that can arguably be addressed with a system as discussed in this paper. We are developing a pipeline for multiple-purpose robotics that can allow robots to enact exercises or other activities that are parsed from textual instructions, such as instruction sheets or online how-to repositories. As a step toward that, we are developing a text to animation system using simple physical exercise instructions as an initial domain. To develop this kind of animation system from exercise instruction sheets, language understanding by machines is an important element, which is not possible using existing systems due to the implicit information contained in these instructions. Therefore, we developed the system to extract all required information from exercise instruction, including Bayesian networks to fill in implicit information. Finally, an animation system is employed for generating virtual enactions that can be validated without the complications that are entailed with physical executions on an actual robotic system before the adequacy of the course of action is assured. In future work, we will also aim at extending the set of variables, facilitating more complex expressions, as well as on including further values of existing variables, with automatically extending Bayesian models, where new categories are introduced based on an expressed need being returned during the human computation step. This would also move the system further towards more complex and multiple sentence instructions with the possibility of mixing movements of different body parts at the same time. Hence, human computation will be explored for making the Bayesian models more expressive. Lastly, the text to animation system will be upgraded to link with established robotic animation platforms that can allow for more accurate physical outcome generation, candidate validation and for eventually transferring the outcomes to physical robots in real-world situations. ACKNOWLEDGEMENTS The research reported in this paper has been (partially) supported by the German Research Foundation DFG, as part of Collaborative Research Center (Sonderforschungsbereich) 1320 ‘EASE—Everyday Activity Science and Engineering’, University of Bremen (http://www.ease-crc.org/). The research was conducted in sub-projects H2 ‘Mining and explicating instructions for everyday activities’ and P1 ‘Embodied semantics for the language of action and change’. Footnotes 1 http://www.instructables.com/; Access date: 18 January 2018. 2 http://www.wikihow.com/Main-Page; Access date: 11 January 2018. 3 http://www.sparkpeople.com/ 4 http://www.bodbot.com/ 5 http://www.yogajournal.com/ 6 http://reasoning.cs.ucla.edu/samiam; Access date: 15 October 2017. REFERENCES 1 Tenorth , M. , Bartels , G. and Beetz , M. ( 2014 ) Knowledge-Based Specification of Robot Motions. Proc. Twenty-first European Conf. Artificial Intelligence, Amsterdam, The Netherlands, The Netherlands ECAI’14, pp. 873–878. IOS Press. 2 Kostavelis , I. and Gasteratos , A. ( 2015 ) Semantic mapping for mobile robotics tasks . Robot. Auton. Syst. , 66 , 86 – 103 . Google Scholar Crossref Search ADS 3 Ju , Z. , Yang , C. and Ma , H. ( 2014 ) Kinematics Modeling and Experimental Verification of Baxter Robot. Proc. 33rd Chinese Control Conference, Nanjing, China, July, pp. 8518–8523. 4 Beetz , M. , Klank , U. , Kresse , I. , Maldonado , A. , Mosenlechner , L. , Pangercic , D. , Ruhr , T. and Tenorth , M. ( 2011 ) Robotic Roommates making Pancakes. 2011 11th IEEE-RAS Int. Conf. Humanoid Robots, Bled, Slovenia, October, pp. 529–536. IEEE. 5 Smeddinck , J.D. , Voges , J. , Herrlich , M. and Malaka , R. ( 2014 ) Comparing Modalities for Kinesiatric Exercise Instruction. CHI ‘14 Extended Abstracts on Human Factors in Computing Systems, New York, NY, USA CHI EA ‘14, pp. 2377–2382. ACM. 6 Uzor , S. and Baillie , L. ( 2013 ) Exploring & Designing Tools to Enhance Falls Rehabilitation in the Home. Proc. SIGCHI Conf. Human Factors in Computing Systems, New York, NY, USA CHI ‘13, pp. 1233–1242. ACM. 7 Coyne , B. and Sproat , R. ( 2001 ) WordsEye: An Automatic Text-to-scene Conversion System. Proc. 28th Annual Conf. Computer Graphics and Interactive Techniques, New York, NY, USA SIGGRAPH ‘01, pp. 487–496. ACM. 8 Åkerberg , O. , Svensson , H. , Schulz , B. and Nugues , P. ( 2003 ) CarSim: An Automatic 3D Text-to-scene Conversion System Applied to Road Accident Reports. Proc. Tenth Conf. European Chapter of the Association for Computational Linguistics-Volume 2, pp. 191–194. Association for Computational Linguistics. 9 Porzel , R. ( 2010 ) Contextual Computing: Models and Applications . Springer Science & Business Media . 10 Chang , N. , Feldman , J. , Porzel , R. and Sanders , K. ( 2002 ) Scaling Cognitive Linguistics: Formalisms for Language Understanding. Proc. 1st Int. Workshop on Scalable Natural Language Understanding. 11 Friedman , N. , Geiger , D. and Goldszmidt , M. ( 1997 ) Bayesian Network Classifiers . Mach. Learn. , 29 , 131 – 163 . Google Scholar Crossref Search ADS 12 Ma , M. ( 2006 ) Automatic conversion of natural language to 3D animation. PhD thesis University of Ulster. 13 Chang , A. , Savva , M. and Manning , C. ( 2014 ) Semantic Parsing for Text to 3D Scene Generation. Proc. ACL 2014 Workshop on Semantic Parsing, Baltimore, MD, June, pp. 17–21. Association for Computational Linguistics. 14 Chang , A. , Monroe , W. , Savva , M. , Potts , C. and Manning , C.D. ( 2015 ) Text to 3D Scene Generation with Rich Lexical Grounding. Proc. 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Int. Joint Conf. Natural Language Processing (Volume 1: Long Papers), Beijing, China, July, pp. 53–62. Association for Computational Linguistics. 15 Smeddinck , J.D. ( 2016 ) Games for Health. In Dörner , R. , Göbel , S. , Kickmeier-Rust , M. , Masuch , M. and Zweig , K. (eds.) Entertainment Computing and Serious Games: International GI-Dagstuhl Seminar 15283, Dagstuhl Castle, Germany, July 5–10, 2015, Revised Selected Papers . Springer International Publishing , Cham . 16 Narayanan , S.S. ( 1997 ) Knowledge-based Action Representations for Metaphor and Aspect (KARMA). PhD thesis University of California at Berkeley. 17 Marneffe , M. , Maccartney , B. and Manning , C. ( 2006 ) Generating Typed Dependency Parses from Phrase Structure Parses. Proc. Fifth Int. Conf. Language Resources and Evaluation (LREC-2006), Genoa, Italy, May. European Language Resources Association (ELRA). ACL Anthology Identifier: L06-1260. 18 Pradhan , S.S. , Ward , W.H. , Hacioglu , K. , Martin , J.H. and Jurafsky , D. ( 2004 ) Shallow Semantic Parsing using Support Vector Machines. In Susan Dumais , D.M. and Roukos , S. (eds.) HLT-NAACL 2004: Main Proceedings, Boston, Massachusetts, USA, May 2–May 7 , pp. 233 – 240 . Association for Computational Linguistics . 19 Björkelund , A. , Bohnet , B. , Hafdell , L. and Nugues , P. ( 2010 ) A High-performance Syntactic and Semantic Dependency Parser. Proc. 23rd Int. Conf. Computational Linguistics: Demonstrations, Stroudsburg, PA, USA COLING ‘10, pp. 33–36. Association for Computational Linguistics. 20 Chen , D. , Schneider , N. , Das , D. and Smith , N.A. ( 2010 ) SEMAFOR: Frame Argument Resolution with Log-linear Models. Proc. 5th Int. Workshop on Semantic Evaluation, Stroudsburg, PA, USA SemEval ‘10, pp. 264–267. Association for Computational Linguistics. 21 Baker , C.F. , Fillmore , C.J. and Lowe , J.B. ( 1998 ) The Berkeley Framenet Project. Proc. 36th Annual Meeting of the Association for Computational Linguistics and 17th Int. Conf. Computational Linguistics—Volume 1, Stroudsburg, PA, USA ACL ‘98, pp. 86–90. Association for Computational Linguistics. 22 Goldberg , A.E. ( 1995 ) Constructions: A Construction Grammar Approach to Argument Structure . University of Chicago Press , Chicago . 23 Evans , V. and Green , M. ( 2006 ) Cognitive Linguistics: An Introduction . Lawrence Erlbaum Associates Publishers . 24 Langacker , R.W. ( 2008 ) Cognitive Grammar: A Basic Introduction . OUP , USA . 25 Croft , W. ( 2001 ) Radical Construction Grammar: Syntactic Theory in Typological Perspective . Oxford University Press . 26 Bergen , B. and Chang , N. ( 2005 ) Embodied construction grammar in simulation-based language understanding . Construction Grammars , 3 , 147 – 190 . Google Scholar Crossref Search ADS 27 Steels , L. ( 2011 ) Design Patterns in Fluid Construction Grammar . John Benjamins Publishing . 28 Brutzman , D. ( 1998 ) The virtual reality modeling language and java . Commun. ACM , 41 , 57 – 64 . Google Scholar Crossref Search ADS 29 Cobo , M. and Bieri , H. ( 2002 ) A Web3D Toolbox for Creating H-Anim Compatible Actors. Proc. Computer Animation 2002 (CA2002), pp. 120–125. IEEE. 30 Kranstedt , A. , Kopp , S. and Wachsmuth , I. ( 2002 ) MURML: A Multimodal Utterance Representation Markup Language for Conversational Agents. AAMAS’02 Workshop Embodied Conversational Agents—Let’s Specify and Evaluate Them! 31 Kopp , S. , Krenn , B. , Marsella , S. , Marshall , A.N. , Pelachaud , C. , Pirker , H. , Thórisson , K.R. and Vilhjálmsson , H. ( 2006 ) Towards a Common Framework for Multimodal Generation: The Behavior Markup Language. Intelligent Virtual Agents: 6th Int. Conf., IVA 2006, Marina Del Rey, CA, USA, August 21–23, pp. 205–217. 32 Elliott , R. , Glauert , J.R.W. , Jennings , V. and Kennaway , J.R. ( 2004 ) An Overview of the SiGML Notation and SiGML Signing Software System. Sign Language Processing Satellite Workshop of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, May, pp. 98–104. 33 Arafa , Y. and Mamdani , A. ( 2003 ) Scripting Embodied Agents Behaviour with CML: Character Markup Language. Proc. 8th Int. Conf. Intelligent User Interfaces, New York, NY, USA IUI ‘03, pp. 313–316. ACM. 34 De Carolis , B. , Pelachaud , C. , Poggi , I. and Steedman , M. ( 2004 ) APML, A Markup Language for Believable Behavior Generation. In Prendinger , H. and Ishizuka , M. (eds.) Life-Like Characters: Tools, Affective Functions, and Applications . Springer Berlin Heidelberg , Berlin, Heidelberg . 35 Chi , D. , Costa , M. , Zhao , L. and Badler , N. ( 2000 ) The EMOTE Model for Effort and Shape. Proc. 27th Annual Conf. Computer Graphics and Interactive Techniques, New York, NY, USA SIGGRAPH ‘00, pp. 173–182. ACM Press/Addison-Wesley Publishing Co. 36 Kaur , K. and Kumar , P. ( 2016 ) HamNoSys to SiGML conversion system for sign language automation . Procedia Comput. Sci. , 89 , 794 – 803 . Google Scholar Crossref Search ADS 37 Brooks , R. and Cagle , K. ( 2002 ) The Web Services Component Model and HumanML. Technical report. OASIS/HumanML technical committee. 38 Kshirsagar , S. , Magnenat-Thalmann , N. , Guye-Vuillème , A. , Thalmann , D. , Kamyab , K. and Mamdani , E. ( 2002 ) Avatar Markup Language. Proc. Workshop on Virtual Environments 2002, Aire-la-Ville, Switzerland, Switzerland EGVE ‘02, pp. 169–177. Eurographics Association. 39 Prendinger , H. , Descamps , S. and Ishizuka , M. ( 2004 ) MPML: a markup language for controlling the behavior of life-like characters . J. Vis. Lang. Comput. , 15 , 183 – 203 . Google Scholar Crossref Search ADS 40 Okazaki , N. , Aya , S. , Saeyor , S. and Ishizuka , M. ( 2002 ) A Multimodal Presentation Markup Language MPML-VR for a 3D Virtual Space. Proc. (CD-ROM) of Workshop on Virtual Conversational Characters: Applications, Methods, and Research Challenges, Melbourne, Australia. 41 Huang , Z. , Eliëns , A. and Visser , C. ( 2003 ) Implementation of a Scripting Language for vrml/x3d-Based Embodied Agents. Proc. Eighth Int. Conf. 3D Web Technology, New York, NY, USA Web3D ‘03, pp. 91–100. ACM. 42 Cassell , J. , Vilhjálmsson , H.H. and Bickmore , T. ( 2001 ) BEAT: The Behavior Expression Animation Toolkit. Proc. 28th Annual Conf. Computer Graphics and Interactive Techniques, New York, NY, USA SIGGRAPH ‘01, pp. 477–486. ACM. 43 Perlin , K. and Goldberg , A. ( 1996 ) Improv: A System for Scripting Interactive Actors in Virtual Worlds. Proc. 23rd Annual Conf. Computer Graphics and Interactive Techniques, New York, NY, USA SIGGRAPH ‘96, pp. 205–216. ACM. 44 Marriott , A. ( 2001 ) VHML—Virtual Human Markup Language. Talking Head Technology Workshop, at OzCHI Conference, pp. 252–264. 45 Geiger , C. , Müller , W. and Rosenbach , W. ( 1998 ) SAM—An Animated 3D Programming Language. IEEE Symposium on Visual Languages, Halifax, Canada. 46 Carletta , J. ( 1996 ) Assessing agreement on classification tasks: the Kappa statistic . Comput. Linguist. , 22 , 249 – 254 . 47 Sarma , H. , Porzel , R. , Smeddnick , J. and Malaka , R. ( 2015 ) Towards Generating Virtual Movement from Textual Instructions: A Case Study in Quality Assessment. Proc. Third AAAI Conf. Human Computation and Crowdsourcing (HCOMP-2015), San Diego, USA. AAAI. 48 Sarma , H. , Porzel , R. and Malaka , R. ( 2017 ) A Step Toward Automated Simulation in Industry. Dynamics in Logistics: Proc. 5th Int. Conf. LDIC, 2016 Bremen, Germany. Springer International Publishing, Bremen, Germany. 49 Sarma , H. , Porzel , R. , Malaka , R. and Samaddar , A.B. ( 2017 ) A Step towards Textual Instructions to Virtual Actions. 2017 IEEE 7th Int. Advance Computing Conf. (IACC), Hyderabad, India, Jan, pp. 239–243. IEEE. 50 Fried , M. ( 2014 ) Construction Grammar. I: A. Alexiadou & T. Kiss (red.), Handbook of Syntax ( 2nd ed ). de Gruyter , Berlin . 51 Sarma , H. , Samaddar , A.B. , Porzel , R. , Smeddinck , J.D. and Malaka , R. ( 2017 ) Updating Bayesian networks using crowds . Neural Netw. World , 529 , 540 . 52 Kopp , S. , van Welbergen , H. , Yaghoubzadeh , R. and Buschmeier , H. ( 2014 ) An architecture for fluid real-time conversational agents: integrating incremental output generation and input processing . J. Multimodal User Interfaces , 8 , 97 – 108 . © The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) TI - A Text to Animation System for Physical Exercises JF - The Computer Journal DO - 10.1093/comjnl/bxy014 DA - 2018-11-01 UR - https://www.deepdyve.com/lp/oxford-university-press/a-text-to-animation-system-for-physical-exercises-kQUDsYDRNU SP - 1589 VL - 61 IS - 11 DP - DeepDyve ER -