TY - JOUR AU - Huang, Lihe AB - Abstract Human interaction is multimodal in nature, and meaning in discourse is created through an interplay of an array of modalities. Inspired by the integration of multimodality and corpus pragmatics, we are concerned with a general multimodal framework that facilitates the exploration on pragmatic questions by using corpus methods. By demonstrating how the study on speech acts in situated discourse is benefited from the multimodal corpus approach, we claim that the scope and methods of pragmatic studies will be enriched and the classic pragmatic theories could be further developed toward multimodal corpus pragmatics. This article, therefore, hopes to inspire further theoretical discussion and case study in this domain. 1 Introduction The last 60 years has witnessed corpus linguistics contribute much to many fields in linguistic studies. Meanwhile, corpus linguistics itself has developed prosperously. Born in the ‘pre-electronic’ era, corpus has speedily developed into an indispensable paradigm due to the sophistication of computer technology and researchers’ more comprehensive outlook toward language. In the last decade or two, besides the large-scale studies of lexico-grammar on the basis of written corpora, ‘there has been a consistent effort in the exploration of spoken discourse’ (Knight and Adolphs, 2008, p. 175). Linguists began to pay increasing attention to the fact that verbal and nonverbal cues fit together into integrated messages in daily interaction. With such a background, multimodal corpus comes into being as the ‘Corpus 4.0’ (Knight, 2009, pp. 16–28), which provides a richer representation of different aspects of discourse, including participants’ utterance content, prosody, gestures, and also the context or co-text in which interaction takes place. Multimodal corpora, therefore, can provide rich and valuable information for both quantitative and qualitative investigation of language use in different situations. On the other hand, pragmatics is believed that it should attach more attention on the daily authentic conversation data. Levinson's foundational text of pragmatics in 1983 has a whole chapter on Conversation Analysis, which makes the borders between the traditions blurred. The representative journals in this field such as ‘Pragmatics’ and ‘The Journal of Pragmatics’ have published several papers of conversation analysts. In Conversation Analysis, researchers believe the importance of gestures1 in interpreting interlocutors’ meaning and hearers’ feedback. Therefore, the real analysis of ‘nonverbal cues in authentic conversation’ is paid due attention by analysts. Corpus pragmatics, combing both key methodologies of corpus linguistics and pragmatics, is a relative newcomer. The inquiries adopting this approach relocate themselves in the data of natural conversation instead of self-constructed examples. However, analyses of pragmatic functions in spoken corpus have not included the quantitative calculation of the interplay between gesture and speech in human communication largely due to the lack of adequate annotated corpus. Fortunately, the multimodal corpus approach prompts a better channel to represent more information and contextual cues in the natural interaction, and provides the possibility of quantitative studies. Therefore, this article attempts to provide a general introduction on how pragmatic studies could benefit itself from adopting multimodal corpus approach, and what specific methods are exploited in such kind of studies. We prefer to adopt the name ‘multimodal corpus pragmatics’ to refer to this inquiry. By analyzing speech acts in situated discourse, it aims to present the practice in which the pragmatic functions can be further investigated with the consideration of gestures and prosody. All these elements, fundamentally speaking, should be included in the studies toward authentic communication. 2 Preliminary Remarks From the disciplinary perspective, multimodal corpus pragmatics is a branch of multimodal corpus linguistics advocated by Baldry and Thibault (2006). As discussed in the previous paragraphs, multimodal studies on communication can be traced back to the video recording method adopted in Discourse Analysis as a pioneering attempt. Though not using the word ‘multimodality’, some pioneering researchers in the very early stage have already made good use of audio and video technologies to record the discourse from speakers’ and hearers’ face-to-face interaction (Kendon, 1967, 1972; Scheflen et al., 1970), drawing particular attention to gaze, facial expression, and the movement of body parts. As an integrated and increasingly enriched paradigm, multimodal study consists of several different approaches and fields, including: (1) multimodal discourse analysis rooted in semiotics, (2) multimodal corpus-based study, and (3) multimodal study in neuroscience, human–computer interaction (HCI), and learning science (Huang and Zhang, 2019). Today’s researchers may vary in their interpretation on the word ‘multimodality’,2 but all of them can reach the following consensus: verbalization in face-to-face interaction is not the only generation of meaning. In other words, human interaction is multimodal in nature. Speakers usually make the most of all available resources to relay their intentions. Hearers likewise comprehend meaning from a wide range of accessible cues or resources. Meanwhile, the scope of ‘corpus’ has been greatly extended. Print-based texts combined with different layouts of pictures constitute a ‘multimodal text corpus’. Some researchers claim that a multimodal corpus is a collection of audio and video data that are multimodally annotated and ready for retrieval in a multimodal way. In this article, we adopt a second definition, that is, a multimodal corpus is defined as ‘an annotated collection of coordinated content on communication channels, including speech, gaze, hand gesture and body language, and is generally based on recorded human behavior’ (Foster and Oberlander, 2007, pp. 307–8). Based on the rich information, multimodal corpora can provide, an increasing number of researchers believe that ‘multimodal corpora are an important resource for studying and analyzing the principles of human communication’ (Fanelli et al., 2010). Though multimodal corpora are still in their infancy and few large-scale corpora have been published or commercially available due to the high cost of construction and copyright restrictions, a range of multimodal corpora using different scales, registers, and languages have been developed (Knight, 2011, pp. 5–8).3 These corpora have been used for a variety of research purposes, including language perception and production and HCI. Examples include; UCLA Library Broadcast NewsScape for news programs from the USA and around the world, AMI Meeting Corpus for developing meeting browsing technology, Multimodal Instruction Based Learning Corpus for teaching and learning discourse, and SmartKom Multimodal Corpus that is HCI based. Currently, the Chinese multimodal corpus in largest scale is the multimodal corpus affiliated to Spoken Chinese Corpus of Situated Discourse in Beijing Area (SCCSD BJ-500) (Gu, 2002), which now contains several subordinated branch corpora, including Children Language Development Corpus, Language Aging Corpus, and Court and Criminal Investigation Discourse Corpus, etc. 3 Rationale for Multimodal Corpus Pragmatics Multimodal corpus pragmatics is an integrated field of inquiry that includes both corpus linguistics and multimodal studies. This section discusses what a multimodal corpus can provide for pragmatic studies, how the context is considered and some issues on the building of such a corpus. 3.1 What can a multimodal corpus provide? By using the technologies facilitated by the multimodal corpus, researchers can explore many language use issues with relatively rich resources that are beyond what the traditional monomodal corpora can provide. This idea can be concluded as multimodal corpus linguistics method toward pragmatic problems. In this sense, multimodal corpus pragmatics is to investigate how different symbols are integrated by language users in a certain context to make meaning by using multimodal corpora. Information provided by corpus is inevitably ‘partial’ as it is impossible to include everything in one single dataset. The methodological and practical process of recording and documenting natural language are selective, therefore the data are always ‘incomplete’ (Knight, 2011, p. 17). This idea is especially true for multimodal corpora. We should note that even though multimodal corpus provides linguists with rich information in a specific time and place for analysis, including speakers’ utterance content, prosodic features, gestures, space layout, and background sound, the corpus can only reinstate partial elements of the reality of natural interaction. This fact is claimed by Gu (2006) as a partial reinstatement of ‘Total Saturated Signification’, which refers to ‘the total of meanings constructed out of the face-to-face interaction with naked senses and embodied messages in Goffman's sense by the acting co-present individuals’ (Gu, 2009, p. 436). Realistically, the current technique for the study on the total saturated signification in the real-life face-to-face interaction is audio and video recording and the relevant annotated multimodal corpus. But we should always bear in mind that the audio and video recording, if realistic, is a comprise for the study on Gu’s term ‘Total Saturated Signification’. The information for other human modalities, e.g. olfactory, haptic, etc. cannot be captured through today’s recording devices (Gu, 2009, p. 436). But with the advancement of techniques, it is reasonable to believe that more types of data can be collected in the future. Understandably, the benefit of analyzing audio and video data for pragmatic function in conversation is quite obvious and substantial: not only in terms of discovering new patterns between the different channels of conveying meaning, but also in terms of adding to the description and identification of patterns that have been derived on the basis of textual analysis of the transcripts (Adolphs, 2008, p. 117). 3.2 Context as an indispensable term in multimodal corpus Pragmatics in nature ‘does not assume a one-to-one relationship between language form and message function. Rather, it attempts to account for the reason and processes behind the phenomenon that certain linguistic forms might be interpreted as carrying a particular function in a particular context’ (Knight and Adolphs, 2008, p. 176). Therefore, context is an indispensable issue in pragmatics studies and corpus approach. The development of multimodal corpus provides us with a better accessibility of context information in human communication. In the studies based on text corpus, even if linguists take discourse context into consideration, much information (such as space layout, activity types, background sound, relationship between speakers and hearers, etc.) is unavailable for pragmatic studies unless the metadata of the corpus is richly provided. Meanwhile, some studies of pragmatics have relied mainly on invented contexts to illustrate the interdependency between speech act function and its place in the wider context of use (Adolphs, 2008, p. 31). This is problematic, or at least inadequate, because a better understanding of discourse context is vital for the interpretation of participants’ utterance content, prosodic features or gestures, and the whole integrated meaning. If we presume the relative inaccessibility of multimodal corpus is the main reason for this kind of insufficiency, then recently prevalent utility of multimodal datasets can provide us with an ideal approach to further exploration on discourse context. There are various approaches and theories which have been developed to account for the patterns of context in discourse. Hymes (1972), for example, proposes a distinction between speech situation, speech event, and speech act. The speech act encodes social norms in linguistic form. The interpretation of speech acts is thus greatly dependent on an analysis of the sequential organization in discourse and the social roles of the speakers in the context (Adolphs, 2008, p. 32). Similarly, Gu (2006) argues the multimodal analysis to pragmatics should adopt a top-down strategy, rather than hold a sentence or utterance as the entrance point of illocutionary act. In Gu’s model, if we regard a social situation per se as the unit of analysis, then social situation is actually a configuration of activity types, which in turn are the configuration of tasks or episodes, and the latter are the configuration of participants’ individual behavior. Furthermore, individual behavior can be segmented into the configuration of talking and doing, of which the lower unit is speech act since it is regarded as the ‘minimum units in human communication’ (Searle, 2001[1969], p. 21). In the author’s study, therefore, an illocutionary act is defined as a property exemplified by the speaker in producing an illocutionary act-token in a given social situation. This segmentation in speech structure can be presented as an analytic scheme of a multimodal text (audio and video recording) as shown in Fig. 1. Fig. 1 Open in new tabDownload slide An analytic scheme of a multimodal text (Gu, 2006) Fig. 1 Open in new tabDownload slide An analytic scheme of a multimodal text (Gu, 2006) 3.3 Multimodal corpus construction for pragmatic studies After these years’ development, many issues on multimodal corpus design and building has been discussed. Kipp et al. (2009), Knight (2011), and Huang (2015) have introduced some general issues on multimodal corpus of natural discourse, including data recording, processing, segmentation, and annotation from both theoretical and practical perspectives. In addition, some existing problems and potential research agenda are also introduced. Except for some common properties shared by text corpus and multimodal corpus, the mark-up schemes and process in multimodal corpus are relatively customized. It is largely because annotating a corpus is dependent on the purpose of the corpus construction, which is claimed as ‘hypothesis-driven’ (Rayson, 2003, p. 1). This is especially true for the study of pragmatics. For instance, the annotation scheme in a speech acts study is quite different from that in the study of backchanneling phenomena. Apparently, the segmentation, annotation, and representation of multimodal data are quite different from the traditional text corpus. This is partially because that today's more comprehensive understanding on ‘meaning-making’ requires us to work beyond linguistic forms. We largely agree that ‘automatic parsing of textual syntagmas on the basis of formal criteria alone is not a panacea for solving all of the problems’ (Baldry and Thibault, 2006, p. 181). In this sense, more consideration on discourse infrastructure and correspondent novel segmentation and annotation schemes should be developed. For orthographic corpus, conventional concordance software, such as ‘Wordsmith’ and ‘Textsmith’, can clearly represent the frequency counts for given linguistic forms in a selected corpus. But the mentioned software is not eligible to the representation of multimodal data, not to mention pragmatic meaning. As a matter of fact, few corpora are semantically or pragmatically marked up and many pragmatic meaning are without any given formal ‘hook’. This kind of ‘form-function mismatch’ in most pragmatic phenomena leads to the consequence that automatic assignment of tags will often lack precision, and therefore manual implementation is unavoidable (Rühlemann and Aijmer, 2014, p. 11). In speech act study, this problem is predominant because there are no constant and explicit lexico-grammatical forms associated with speech act types. Therefore, researchers are not able to annotate, represent, or search a certain type of speech acts just by one or several single forms. In most cases, the speech acts have to be tagged manually through line-by-line analysis and corpus should be tailored. On the other hand, searching in a multimodal corpus is also different from that in text corpus. In a text corpus, a researcher might simply type a certain word or expression into the search bar, while she/he has to manually search for all the possible forms of multimodal information in a multimodal corpus. Such concordances are also limited to present transcripts and text files, rather than multimodal datasets (Knight, 2011, p. 48). Summarily, unlike the lexico-grammatical studies based on corpus which has a relatively universal data representation format, the pragmatic studies usually customizes its representation. Orthographic corpora are often equipped with similar representation formats, while multimodal corpora do not even have a widely accepted format or criterion to represent different multimodal datasets. In this sense, a series of represented multimodal data is not necessarily suitable to other multimodal corpus-based pragmatic studies since the segmentation and annotation should be tailed for different research purpose. Therefore, just as Knight (2011, p. 49) claims, the requirements for representing multimodal corpora for the accurate and appropriate analysis and re-use are still being answered by researchers in this field. Some attempts, such as Digital Replay System (Knight et al., 2010), have been made to provide a possible solution (Knight, 2011, pp. 187–93). In this sense, although there exist many multimodal corpus tools, e.g. ‘Anvil’, ‘Elan’, ‘MMAV’, the integrating search, concordance, and representation functions still need enhancing. The general pattern in corpus linguistics is identified as ‘highly quantitative’ of given data represented in a corpus. By using corpora, researchers are concerned with ‘the patterns of language, determining what is typical and unusual in given circumstances’ (Conrad, 2002, p. 77). With the use of statistics in corpus, researchers can have a clear exposure to those regular, frequent, and more generalizable patterns of meaning in use. This is also true for the multimodal corpus approach. The only difference on the statistics and quantitative concern between traditional text or audio corpus and multimodal corpus is that more information is involved into calculation. Sometimes the algorithm is really complicated and the presentation pattern is beyond the scope of traditional text corpus. Sinclair (1996) claims that corpus-driven and corpus-based are the two basic approaches in corpus linguistics. Especially for corpus-based approach, it has been now commonly used in a wide range of both linguistic fields as well as humanities and social sciences. For multimodal corpus studies, most researches are corpus based. This is not only due to the relative shortage of large-scale multimodal corpora for deep quantitative calculation, but also because by using ‘corpus-driven’, linguists can be informed by the corpus itself which allows it to lead us in all sorts of directions (Rayson, 2003, p. 1). While aiming to promote the general idea of ‘multimodal corpus pragmatics’, this article has to be modest while introducing feasible multimodal datasets for the study of live speech acts in natural discourse. 4 A Case Study in Multimodal Corpus Pragmatics: Speech Act Pragmaticians may draw distinctions between micropragmatics and macropragmatics (Mey, 2001). Today’s pragmatics have developed into a rather broad scope of inquiry into language use. Since most researchers agree on the multimodal nature of human interaction, the multimodal approach could greatly benefit the study of pragmatic issues. To illustrate how multimodal corpus pragmatics operates, this section introduces a study of illocutionary force, as a typical sample of the multimodal corpus approach toward traditional pragmatics topics. 4.1 Design, method, and data Since human interaction is multimodal in nature, speech act expression and corresponding speech act episodes rely as much on prosody and gesture in creating particular functions as they rely on the actual words and discourse structures that are being used (Adolphs, 2008, p. 117). Therefore, a probe on the prosodic and gestural analysis of speech act expression in naturally occurring context would add an important dimension to their functional profiles (Adolphs, 2008, p. 118). Based on these, Huang (2014, 2017a, 2018c) launches a multimodal corpus-based study on live speech acts in situated discourse. Clearly, live speech acts in situated discourse are also multimodal and much of the information derived from these speakers can serve as the ‘illocutionary force indicating devices’ (IFIDs). Additionally, Huang’s study brings emotion into the analysis of live illocutionary acts because emotion is usually related to the self-interest of discourse participants and emotion further influences the performance of prosodic features and gestures. Therefore, it is quite necessary to investigate how multimodal information, including but not limited to prosodic features and non-verbal acts, has been impacted by speakers’ occurrent emotions. Additionally, how the triadic interaction of emotion, prosody, and non-verbal acts produces a variety of live illocutionary forces is explored. The key issue for a corpus-based multimodal study on pragmatic questions involves the formation of appropriate units and the various intermediate levels of analysis that are relevant to the analyst's purposes (Baldary and Thibault, 2006, p. 168). The characteristics of such pragmatic studies are to be beyond or at least unlimited to the traditional syntactic units, e.g. phrases or clauses. This indicates that the multimodal corpus-based studies require researchers to determine the basic analytic units and to have a clear picture of framework structure in mind ahead of time. Apparently, this analytic scheme includes the consideration of the influence of context upon speech acts, and it is an integrating pattern of analysis on multi-resources of meaning and global context. In the aforementioned project, for instance, the research aims to investigate how speakers use diverse devices to practice speech acts and then express live illocutionary forces in the unprepared natural discourse. Austin (1962, pp. 73–6) once mentioned that ‘tone of voice, cadence, emphasis and accompanying gestures’ are all the IFIDs. But, Austin and the researchers that have followed have failed to systematically explore ‘how’ these devices (including various gestures: body pose, facial expression, and head movement) interact with the speakers’ discourse content and prosody. Or, ‘how’ each of these devices jointly convey the illocutionary forces. Although the text corpus-based study on speech acts received a great deal of attention from linguists, many linguists agree that the traditional text can only provide them with very limited cues when exploring the performance of live speech acts because those multimodal cues in the instances of speech acts fail to be recognized even though they are clearly noticed by Searle as IFIDs. The collection, processing, and compiling of multimodal data are fundamental to the further investigation of live illocutionary forces. Huang (2018a) reports how he builds a multimodal corpus of speech acts in Chinese situated discourse. Scheme design, working definition, annotation evaluation, and data representation of such a multimodal corpus have been elaborated, which hopes to provide an ignition to further investigation on speech acts adopting such a multimodal corpus. The mini multimodal corpus constructed by the author contains 134 tokens of illocutionary force (correspondingly 20 different types of speech acts) that are classified into four groups (neutral, beneficial, harmful, and counterproductive).4 All the 134 tokens of illocutionary force in four groups are annotated in the 13 tiers. The annotation tags and the connotations are as follow5: ‘Performance Unit of Illocutionary Force’ refers to a specific instance of a certain speech act type performed by a speaker in the specific situation. ‘Activity Type’6 refers to any culturally recognized activity, e.g. teaching, job interview, or party. ‘Turn-taking’ is one of the important analysis units of utterance in this study. ‘Background Emotion’ refers to the emotional states that are closely related to the physical and mental condition of the speaker. ‘Primary Emotion’ is most fundamental type of human emotion and shared by all races of men. In this study, seven types of primary emotion are listed, i.e. joy, anger, sadness, fear, disgust, surprise, and anxiety. ‘Social Emotion’ is socially learned emotion, and it can be subcategorized into: (1) positive social emotions, (2) negative social emotions, and (3) neutral social emotions (Gu, 2013). ‘Intonation Group’ in this study is segmented by the ‘external criteria’, e.g. the phonetic cues, such as pauses etc. ‘Prosodic Pattern’ refers to the pitch, length, and loudness of the sound in speech act. ‘Other Prosodic Features’ includes the information of stress, pause, quality of the sound, and other para-linguistic cues. ‘Tasking-Performance’ refers to the specific task taken by the speaker when she/he is performing a speech act. ‘Non-verbal Act’ refers to gestures, eye contact, facial expression, head and hand movement, body posture, etc. ‘Intentional State’ refers to the speaker's attitude, belief, hope, or requirement when performing a speech act. ‘Interdependency’ contains three aspects, including ‘forward-and-backward interdependency’, ‘illocution-and-reality interdependency’, and ‘doing-and-talking interdependency’. Then, the pilot annotation with the software ‘Elan’ and ‘Praat’, and the data validation by laymen are implemented. The pilot annotation provides the model for the later large-scale annotation of the whole corpus. All the data and annotation in this corpus are evaluated by the linguistic experts for their reliability and validity; the result shows the data are reliable and valid for further analysis (Huang, 2018a). Here is an example for an annotated file by ‘Elan’ and ‘Praat’, which is an instance of Speech Act ‘Complain’ (bàoyuàn) in Chinese (Huang, 2018a) (Fig. 2). Fig. 2 Open in new tabDownload slide Screenshot of ‘Elan’ and ‘Praat’ annotation Fig. 2 Open in new tabDownload slide Screenshot of ‘Elan’ and ‘Praat’ annotation The tokens in the same type of speech acts are grouped into an independent folder entitled the name of that speech act, e.g. ‘Complaining’, and further grouped into different classifications. Here is the distribution of 134 instances of 20 speech act types in this mini-corpus (Huang, 2018a) (Table 1). Table 1 Grouping of 134 instances of speech acts Neutral . Beneficial . Harmful . Counterproductive . Explain 20 Be satisfied 10 Grouse 14 Be helpless 7 Comment 16 Praise 7 Complain 11 Be disappointed 4 Judge 8 Urge 5 Criticize 4 Feel aggrieved 4 Request 6 Assume 3 Worry 3 Regret 2 Promise 5 Invite 1 Fear 3 Be surprised 1 Neutral . Beneficial . Harmful . Counterproductive . Explain 20 Be satisfied 10 Grouse 14 Be helpless 7 Comment 16 Praise 7 Complain 11 Be disappointed 4 Judge 8 Urge 5 Criticize 4 Feel aggrieved 4 Request 6 Assume 3 Worry 3 Regret 2 Promise 5 Invite 1 Fear 3 Be surprised 1 Open in new tab Table 1 Grouping of 134 instances of speech acts Neutral . Beneficial . Harmful . Counterproductive . Explain 20 Be satisfied 10 Grouse 14 Be helpless 7 Comment 16 Praise 7 Complain 11 Be disappointed 4 Judge 8 Urge 5 Criticize 4 Feel aggrieved 4 Request 6 Assume 3 Worry 3 Regret 2 Promise 5 Invite 1 Fear 3 Be surprised 1 Neutral . Beneficial . Harmful . Counterproductive . Explain 20 Be satisfied 10 Grouse 14 Be helpless 7 Comment 16 Praise 7 Complain 11 Be disappointed 4 Judge 8 Urge 5 Criticize 4 Feel aggrieved 4 Request 6 Assume 3 Worry 3 Regret 2 Promise 5 Invite 1 Fear 3 Be surprised 1 Open in new tab After the multimodal cues are annotated in the ‘Elan’ file, researchers can ‘record’ the annotated data in a ‘Microsoft Excel’ table by using the retrieval function of ‘Elan’, which is a preparation for further statistical analysis. By using the data annotated by ‘Elan’ and ‘Praat’, we can analyze the sample tokens from the different 20 types of illocutionary force in-depth. Meanwhile, the retrieval function in ‘Elan’ enables us to concordance the transcribed speech act verbs or other words, which are synchronized with its audio and video information. In ‘Elan’, the corresponding annotation information in all the tiers can be presented in concordance line. These search results enable us to further explain the interaction among speech act verbs, syntactic structures, prosodic features, and gestures when a speech act is being performed by the speaker. Meanwhile, the operation of multiple files in ‘Elan’ is also accessible. This is very important because the comparison between different speech acts is necessary for meaningful research. 4.2 Discussion and conclusion Through the concordance and data representation by ‘Elan’ and ‘Praat’ in the annotated tiers of performance units, we can analyze the forms and functions of different multimodal cues that are embodied by those speech acts in the dimensions of language structures, prosodic features, and gestures.7 First, based on both Austin's observation on the accompaniments of the utterance when a speaker performs an illocutionary act and the cues reported by the multimodal corpus, we can draw the conclusion that the IFIDs range in its scope from language structures, prosodic features to gestures, etc. More specifically, the explicit speech act verbs/phrases or some other syntactic structures are the frequently used devices in the perspective of language structure (though the usage of speech act verb is at a very low frequency according to the statistics from our multimodal corpus); prosodic pattern (including pitch, speech rate, intonation contour), occurrence of pause and other prosodic features (such as voice quality, interjections, other para-linguistic features) also play an important role in the performance of illocutionary force; some gestures are accessible to all the classifications of illocutionary force while the others are very exclusive to some certain groups. This diagram presents the resource system of potential IFID in Chinese speech acts (Fig. 3). Fig. 3 Open in new tabDownload slide Potential IFIDs in Chinese Speech Acts Fig. 3 Open in new tabDownload slide Potential IFIDs in Chinese Speech Acts In real-life discourse, speakers select certain devices in this diagram to perform an illocutionary act due to the social situation, pragmatic intention, and other influence factors. Listeners then figure out the illocutionary force by analyzing different IFIDs in the related social situation. Secondly, through the retrieval of the data that are annotated in the different tiers of ‘Praat’ and ‘Elan’ files in each token of illocutionary force, we get the frequencies of all the recorded multimodal cues and different types of emotions and then put them into the calculation by Factor Analysis in statistics. This step is one of the most kernel procedures in multimodal corpus linguistics since the researcher has to present the true data on which some rules can be concluded. This study’s statistics indicates that the regular patterns of emergence and concurrence of emotions and multimodal cues vary in the four mentioned groups of illocutionary force. It also shows the distinctive properties among the interaction of emotional states; prosodic features; and gestures in neutral, beneficial, harmful, and counterproductive groups. Certain emotions are highly correlated to some types of illocutionary forces, and these emotions determine the configuration of speech content, prosodic, and gestural features. More specifically, in the neutral group, prosodic features do not show any distinctive evidence. However, various pitch and intonation patterns do show up. Soft sounds and silence are correlated to negative social emotion. Frequency of hand movement in this group is higher than those in the other three groups: in the beneficial group, high pitch, stress sound, and laughter/smile more frequently appear together with positive emotion; in the harmful group, pitch, speech rate, intonation, and stress in prosody and eye-rolling, head-shaking, and weeping in gestures are the most accessible cues to reflect the negative emotion; in the counterproductive group, in addition to speech rate, intonation, pitch, and stress, other cues such as soft sounds, sighs, and head-movements frequently appear with the negative emotion. These statistics to a large extent show how emotional states, prosodic features, and non-verbal acts interact and correlate with each other to produce a variety of live illocutionary forces, and discover the mechanism of the emergence of different IFIDs. Thirdly, the study deduces the possible influence factors for the performance of illocutionary acts in situated discourse through the observation on the live data in the multimodal corpus. The possible influence factors of live speech acts and IFIDs include: personal style, expression capability, pragmatic intention, interdependency relation, social situation, cultural conventions, etc. Personal style refers to the individuals’ introverted or extroverted, simple or sophisticated personalities, which can influence different expression. Expression capability refers to the individual competence of utilizing diverse resources in language structures, rhetoric skills, and other non-verbal expression. Pragmatic intention means whether and how the speaker would like the listener to understand the real intention or emotion of the interaction. Interdependency relation includes three aspects: ‘forward-and-backward interdependency’ (the relation between speech act and former/later utterance), ‘illocution-and-reality interdependency’ (the relation between speech act and what is happening at the here-and-now behavior setting/beyond the here-and-now/both), and ‘doing-and-talking interdependency’. The social situation refers to different situations, activity types, or surroundings in which live speech acts are located. Cultural conventions refer to the manner of expression—for example, in certain cultures, direct and explicit is more appreciated than indirect and implicit. And, vice-versa in other cultures. The performance of illocutionary act in situated discourse is influenced by many different factors in a collaborative way. From previous discussion, we can draw the conclusion that the live illocutionary act in situated discourse is rooted in the whole multimodal interaction of speakers, thus the illocutionary force is multimodal in nature. IFIDs range in its scope from language structures, prosodic features to non-verbal gestures, etc. All these resources work in a collaborative way to help speakers to perform illocutionary acts. Meanwhile, the concurrence pattern of the IFIDs is very complicated, but generally speaking, it can be concluded into two patterns: (1) one or several IFIDs dominate in the performance to determine the type of certain illocutionary force; and (2) several IFIDs collaborate in the performance without a dominant device. The major findings previously discussed further extend the scope of Speech Act Theory and develop the concept of IFID by using the novel methodology and new linguistic data, though some limitations remain unsolved, including the limited amount and scope of speech acts, the neglect of perlocutionary acts, etc. Generally speaking, the previous discussion offers an example that multimodal corpus can provide a brand new insight into the understanding of the nature of language use. Previously, due to the constraint on methodology, most pragmatics studies were conducted based on the transcription of vocal-auditory speech. This kind of orthographic text transcription, undoubtedly, filtrates much information that is very important in real-life face-to-face conversation. If we do not use a multimodal approach, some cues for speech act expression in addition to speech act verbs and discourse content cannot be systematically found, and the interaction rules rooted in situated discourse cannot be completely presented. In this sense, multimodal corpus pragmatics can lead us to rethink the existing conceptualization of pragmatics and the dominant idea of language use as a mere verbal communication should be reconsidered. In multimodal corpus-based study on speech act, speaker/hearer is no longer just a rational meaning communicator based on traditional conceptualization of pragmatics, but a ‘whole man’ in Firth’s sense (1957, p. 19) with emotion, which can be embodied in speech, prosody, and gesture. 5 Future Agenda on Multimodal Corpus Pragmatics The integration of multimodal corpus approach into pragmatics have been increasingly recognized today. This trend is also reflected in a growing number of multimodal corpus-based studies that are presented in pragmatics academic conferences. Recent years’ panels and keynote speakers in ‘International Pragmatics Conference’, for example, have presented a growing number of multimodal corpus-based studies of pragmatic issues. Despite an array of relevant studies, there still are many challenges and opportunities in today’s multimodal corpus pragmatics: Most corpora are relatively small in size and not accessible to other researchers. There is still no publicly available large-scale multimodal corpora. This is partially because building of multimodal corpora is really painstaking and time-and-cost consuming. It would take a very long time to mark up all the gestural instances and validate the annotation across multiple annotators before a final version of the annotated corpus is completed. Besides, there is no widely accepted or agreed coding schemes, and the segmentation and annotation of video data with prosodic and gestural information for pragmatic study purpose vary from one project to another, which makes a multimodal corpus quite unique and exclusive in the studies. In pragmatics, the annotation would be quite research-specific rather than generally shared, and the corpus design, analytic frameworks, and many other aspects are different. Certainly, the ideal coding scheme should be developed in such a way that they can be shared across different communities, all likely to have different analytic needs. Therefore, a multimodal corpus for pragmatic study is sufficiently balanced to achieve the aims of a particular linguistic enquiry, which might be quite adequate for other users with different research goals. Nevertheless, we can ‘use corpora in full awareness of their possible shortcomings’ (Sinclair, 2008, p. 30) ‘because there exists no better alternative resource for the analysis of real-life language-in-use than a corpus offers’ (Knight, 2011, p. 18). The need for an integrated approach and the tools necessary for the representation of data are still a work in progress. Though many multimodal processing tools have been developed, including multimodal annotators, concordancers, automatic parsing, etc., a more integrated tool still remains elusive. Based on the assumption that multimodal concordancers can reveal recurrent patterns that constitute important building blocks in the construction of meaning (Sinclair, 1991), Baldry and Michele (2005) developed the Multimodal Corpus Authoring (MCA) system to advance the multimodal concordancing. Similarily, Kay O'Halloran’s group developed the Multimodal Video Analysis (MMVA) for Multimodal Discourse Analysis in the framework of Systemic Functional Linguistics. These attempts enhance the explanatory power and analytic accessibility to multimodal corpora, but more data processing and statistics tools are still needed. More traditional research topics in pragmatics could be reinvestigated in the framework of multimodal corpus approach. Only through the rich and large-scale studies can this new inquiry flourish. The novel data and observation methods can promote the extension, if not revision, to the classic theories in pragmatics, including conversation implicature, presupposition, (im)politeness, identity construction, etc. Taking Brown and Levinson's verbal politeness strategies as a starting point, for example, Pennock-Speck and Saz-Rubio (2013) found that ‘politeness’ strategies in charity ads on British television are constructed through both paralinguisic and extralinguisic modes of communication. Forceville (2014) sketches, for example, that discussions of visual and multimodal discourse can be embedded in Sperber and Wilson's Relevance Theory in pragmatics, which extends the applicability of RT since it claims itself to be ready for all forms of communication even though it has traditionally been applied to the spoken verbal varieties. Meanwhile, discourse markers (DMs) have drawn much attention from pragmaticians. DMs are believed to be multifunctional and play communicative roles at different dimensions simultaneously. However, most studies on DMs predominantly investigate their use and function in the text-based frameworks, and therefore the researchers neglect the coexistence of DMs and speakers' gestures (Hata, 2016). Similarly, Knight (2011) provides a systematic analysis of the pragmatic functions of backchanneling phenomena in English conversation based on the Nottingham multimodal corpus (NMMC), which extends the traditional studies on the relationship between verbalization and gesture in people’s active listening. Multimodal corpus pragmatics can be also applied to many different fields, e.g. geronto-linguistics. Bolly and Boutet (2018) initiated a project, based on the multimodal CorpAGEst corpus, to study the pragmatic competence of old people by exploring their use of verbal and non-verbal pragmatic markers in real-life conversations, which is both methodologically innovative and socially significant in pragmatic studies. Similarly, Huang (2017b) initiates a multimodal corpus-based study on speech acts of elderly patients with Alzheimer's disease (AD). The preliminary observation concludes that the number of certain types of speech acts has a significant change, e.g. emotion expressive speech acts will drop in the moderate to severe AD patient. With the decline of linguistic competence, AD seniors are inclined to ‘compensate’ their pragmatic communication from other resources, including gestures, gaze, body pose, etc. That observation further justifies the adoption of multimodal data and the Perkins' emergentist model for clinical pragmatics (Perkins, 2007). With multimodal data, we can even conduct a multimodal investigation on the emotion expression in the field of intercultural pragmatics, and provide more convincing evidence to investigate the issues in both pragmatics and Rhetoric (Huang, 2018b). 6 Conclusion Today's neuropsychology informs us that humans are born multimodal. We combine multimodalities in input and output when experiencing the world. That is to say, all human interaction is multimodal in nature. In this sense, more and more scholars agree that conducting linguistic research should use multimodal as the benchmark. Conversely, using a purely textural type of corpus and traditional analysis can only lead to a very incomplete description of the conveyance of meaning in natural conversation. Many linguistic theories or interaction rules could be enriched or even revised if fresh methodology and new kinds of data are provided. This is how today’s pragmatics reflect itself with the introduction of both corpus method and multimodal data. In this sense, the traditional dichotomy between the verbal and nonverbal information in pragmatics and other linguistic branches seems to be inadequate if the linguistic theory goal is to target genuine human communication. The relationship between verbal and nonverbal cues in relaying pragmatic meaning is still very intractable, but the multimodal corpus approach can provide a novel perspective to handle this intriguing issue. Although we have to admit that the multimodal corpus pragmatics, or multimodal approach, is still in its infancy, and that more theoretical and practical investigation is needed, the multimodal corpus approach bridges the linguistic gap between verbal and visual, which had been neglected for far too long. Although the illustration in this article is made by a pilot study on speech act, we should bear in mind that multimodal corpus pragmatics is far more than a speech act study. Pragmatics is to be seen not as a mere component of language, but rather as ‘a general cognitive, social, and cultural perspective’ (Verschueren, 1999, p. 7). In this sense, it is unwarranted to restrict ‘pragmatics’ to specific phenomena (e.g. speech acts). The benefit of multimodal corpus approach to pragmatics is far more than better or more comprehensive understanding of meaning in use in speech, but a promotion of re-evaluation on ‘claims and concepts that originate in more philosophical traditions where the conceptualization of pragmatic functions has arguably received most attention’ (Knight and Adolphs, 2008, p. 175). We also believe that such studies are of great significance to Artificial Intelligence development and HCI systems since Speech Acts (or Dialogue Acts) are believed to be the minimum units in human communication; that is, they are multimodal in nature. Meanwhile, we also claim that multimodal corpus, as a novel and promising approach to linguistics, can be applied into many sub-disciplines in language studies, such as Conversation Analysis, Interactional Linguistics, etc. In summary, by including nonverbal resources in the observatory data, the scope and methods of pragmatic studies will be enriched and the classic pragmatic theories will be enhanced and developed more toward authentic face-to-face interaction. This exploratory study hopes to inspire further theoretical discussion and case study on pragmatic questions with the multimodal corpus approach, which will carry forward a novel domain of inquiry, that is, multimodal corpus pragmatics. Declaration The research use of the multimodal data is permitted by the speakers. Funding This paper is partially granted by Tongji University World-class Discipline Development Project (No. 1100141406). References Adolphs S. ( 2008 ). Corpus and Context: Investigating Pragmatic Functions in Spoken Discourse . Amsterdam : John Benjamins Publishing Company . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Austin J. L. ( 1962 ). How to Do Things with Words . London : Oxford University Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Baldry A. , Michele B. et al. . ( 2005 ). The MCA Project: concepts and tools in multimodal corpus linguistics. In Carlsson, A. M., Løvland, L. and Malmgren, G. (eds), Multimodality: Text, culture and use. Proceedings of the Second International Conference on Multimodality. Kristiansand: Agder University College/Norwegian Academic Press, pp. 79–108. Baldry A. , Thibault P. ( 2006 ). Multimodal Corpus Linguistics. In Thompson G., Hunston S. (eds), System and Corpus: Exploring Connections . Britain : Equinox Publishing Ltd ., pp. 164 – 83 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Bolly C. , Boutet D. ( 2018 ). The multimodal CorpAGEst corpus: keeping an eye on pragmatic competence in later life . Corpora , 13 ( 3 ): 279 – 317 . Google Scholar Crossref Search ADS WorldCat Conrad S. ( 2002 ). Corpus linguistic approaches for discourse analysis . Annual Review of Applied Linguistics , 22 : 75 – 95 . Google Scholar Crossref Search ADS WorldCat Fanelli G. , Gall J., Romsdorfer H., Weise T., Van Gool L. ( 2010 ). 3D vision technology for capturing multimodal corpora: chances and challenges. In Proceedings of the LREC Workshop on Multimodal Corpora, Mediterranean Conference Centre, Malta, 18 May 2010, pp. 70–3. Firth J. R. ( 1957 ). Papers in Linguistics 1934-1951 . London : Oxford University Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Forceville C. J. ( 2014 ). Relevance theory as model for analyzing visual and multimodal Communication. In Machin D. (ed.), Visual Communication . Berlin : Mouton de Gruyter , pp. 51 – 70 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Foster M. E. , Oberlander J. ( 2007 ). Corpus-based generation of head and eyebrow motion for an embodied conversational agent . Language Resources and Evaluation , 41 ( 3/4 ): 305 – 23 . Google Scholar OpenURL Placeholder Text WorldCat Gu Y. ( 2002 ). Compiling Chinese spoken corpus: some theoretical issues. In Chinese Academy of Social Sciences (ed.) Globalization and the 21st Century . Beijing : The Social Sciences Publisher , pp. 484 – 500 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Gu Y. ( 2006 ). Multimodal text analysis: a corpus linguistic approach to situated discourse . Text and Talk , 26 ( 2 ): 127 – 67 . Google Scholar Crossref Search ADS WorldCat Gu Y. ( 2009 ). From real-life situated discourse to video-stream data-mining . International Journal of Corpus Linguistics , 14 ( 4 ): 433 – 66 . Google Scholar OpenURL Placeholder Text WorldCat Gu Y. ( 2013 ). A conceptual model of Chinese illocution, emotion and prosody. In Tseng C. Y. (ed.), Human Language Resources and Linguistic Typology . Taibei : Academia Sinica , pp. 309 – 62 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Hata K. ( 2016 ). On the importance of the multimodal approach to discourse markers: a pragmatic view . International Review of Pragmatics , 8 ( 1 ): 36 – 54 . Google Scholar Crossref Search ADS WorldCat Huang L. ( 2014 ). The study on speech acts from the perspective of situated discourse . Journal of Zhejiang International Studies University , 3 : 45 – 53 . Google Scholar OpenURL Placeholder Text WorldCat Huang L. ( 2015 ). Corpus 4.0: multimodal corpus building and related research agenda . Journal of PLA University of Foreign Languages , 38 ( 3 ): 1 – 7 , 48. Google Scholar OpenURL Placeholder Text WorldCat Huang L. ( 2017a ). Speech Act Theory and Multimodal Study on Language. Journal of Beijing International Studies University, in press. Huang L. ( 2017b ). The design and agenda of multimodal corpus-based study of language ageing and related issues: from the perspective of speech act. Speech at The 4th Sino-UK Symposium of Corpus Linguistics, Qingdao, 5 August 2017. Huang L. ( 2018a ). Issues on multimodal corpus of Chinese speech acts: a case in multimodal pragmatics . Digital Scholarship in the Humanities , 33 ( 2 ): 316 – 26 . Google Scholar Crossref Search ADS WorldCat Huang L. ( 2018b ). Establishing multimodal rhetoric: also looking for a link between rhetoric and pragmatics . Contemporary Linguistics , 20 ( 1 ): 117 – 32 . Google Scholar OpenURL Placeholder Text WorldCat Huang L. ( 2018c ). A Multimodal Corpus-based Study of Illocutionary Force: An Exploration of Multimodal Pragmatics from a New Perspective . Shanghai : Shanghai Foreign Language Education Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Huang L. , Zhang D. ( 2019 ). A multi-core parallel system: paradigm, approaches and fields in multimodal study [J] . Foreign Language Education , 40 ( 1 ): 21 – 26 . Google Scholar OpenURL Placeholder Text WorldCat Hymes D. ( 1972 ). On communicative competence. In Pride J. B., Holmes J. (eds), Sociolinguistics . Harmondsworth : Penguin , pp. 269 – 93 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Kendon A. ( 1967 ). Some functions of gaze direction in social interaction . Acta Psychologica , 26 : 22 – 63 . Google Scholar Crossref Search ADS PubMed WorldCat Kendon A. ( 1972 ). Some relationships between body motion and speech. In Seigman A., Pope B. (eds), Studies in Dyadic Communication . Elmsford, NY : Pergamon Press , pp. 177 – 216 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Kendon A. ( 2004 ). Gesture: Visible action as utterance . Cambridge : Cambridge University Press . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Kipp M. , Martin J.-C., Paggio P., Heylen D. (eds) ( 2009 ). Multimodal Corpora: From Models of Natural Interaction to Systems and Applications . Berlin : Springer Berlin Heidelberg . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Knight D. ( 2009 ). A Multi-modal Corpus Approach to the Analysis of Backchanneling Behaviour. Ph.D. dissertation, The University of Nottingham. Knight D. ( 2011 ). Multimodality and Active Listenership: A Corpus Approach . London : Bloomsbury . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Knight D. , Adolphs S. ( 2008 ). Multi-modal corpus pragmatics: the case of active listenership. In Romero-Trillo J. (ed.), Pragmatics and Corpus Linguistics: A Mutualistic Entente . Berlin : Mouton de Gruyter, pp . 175 – 90 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Knight D. , Tennent P., Adolphs S., Carter R. ( 2010 ). Developing ubiquitous corpora using the digital replay system (DRS). In Proceedings of the LREC 2010 (Language Resources Evaluation Conference) Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, Giessen, Germany, May 2010, pp. 16–21. Mey J. L. ( 2001 ). Pragmatics: An Introduction , 2nd edn. Beijing : Foreign Language Teaching and Research Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Pennock-Speck B. , de Saz-Rubio M. ( 2013 ). A multimodal analysis of facework strategies in a corpus of charity ads on British television . Journal of Pragmatics , 49 : 38 – 56 . Google Scholar Crossref Search ADS WorldCat Perkins M. R. ( 2007 ). Pragmatic Impairment[M] . Cambridge : Cambridge University Press . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Rayson P. ( 2003 ). Matrix: A Statistical Method and Software Tool for Linguistic Analysis Through Corpus Comparison. Ph.D. thesis, Lancaster University. Rühlemann C. , Aijmer K. ( 2014 ). Corpus pragmatics: laying the foundations. In Aijmer K., Rühlemann C. (eds), Corpus Pragmatics: A Handbook . Cambridge : Cambridge University Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Scheflen A. E. , Kendon A.,, Schaeffer J. A. ( 1970 ). A comparison of videotape and moving picture film in research in human communication. In Berger M. M. (ed.), Videotape Techniques in Psychiatric Training and Treatment . New York : R. Brunner, Inc . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Searle J. ( 2001 [1969]). Speech Acts: An Essay in the Philosophy of Language . Beijing : Foreign Language Teaching and Research Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Sinclair J. ( 1991 ). Corpus, Concordance, Collocation . Oxford : Oxford University Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Sinclair J. ( 1996 ). The search for units of meaning . TEXTUS , IX : 75 – 106 . Google Scholar OpenURL Placeholder Text WorldCat Sinclair J. ( 2008 ). Borrowed ideas. In Gerbig A., Mason O. (eds), Language, People, Numbers - Corpus Linguistics and Society . Amsterdam : Rodopi BV , pp. 21 – 42 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Verschueren J. ( 1999 ). Understanding Pragmatics . London : Arnold . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Footnotes 1 In this article, gesture is an umbrella term for all movement of different parts in one’s body, including hand gesture, head movement, facial expression, body pose, etc. Kendon defines a gesture as a ‘visible action … used as an utterance or as part of an utterance’ and actions that has ‘the features of manifest deliberate expressiveness’ (Kendon, 2004, p. 7). 2 The concept of ‘modality’ mainly has three definitions: (1) the sense organs and the related nervous systems, (2) the semiotic resources for meaning construction, and (3) the way of representing information through some physical media (Huang and Zhang, 2019). 3 An index of multimodal corpora and their specific compositional characteristics, from natural conversation to elicited data, can be found in Knight (2011, pp. 6–7). 4 The four groups of neutral, beneficial, harmful, and counterproductive illocutionary acts are established according to the relation between what is talked about by speakers and the interest of discourse participants. This grouping aims to bring emotion into the analysis of live illocutionary acts since emotion is usually related to the self-interest of discourse participants and further influences the performance of prosodic features and gestures (Gu, 2013). 5 The working definitions and connotations of all the 13 tiers have been completely elaborated in detail in Huang (2018a, pp. 320–22). 6 The tier of Activity Type is a solution to tackle the context issue in speech acts analysis. Since social situation is actually a configuration of different activity types which contain a series of speech acts, the analysis on activity types in a large sense provides the equivalent ‘context’ review for speech acts studies. 7 More detailed discussion relevant to the following content can be found in Huang (2018c), which provides a systemic study on illocutionary force adopting the method of multimodal corpus. © The Author(s) 2020. Published by Oxford University Press on behalf of EADH. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) TI - Toward multimodal corpus pragmatics: Rationale, case, and agenda JO - Digital Scholarship in the Humanities DO - 10.1093/llc/fqz080 DA - 2021-04-29 UR - https://www.deepdyve.com/lp/oxford-university-press/toward-multimodal-corpus-pragmatics-rationale-case-and-agenda-1Ivz7H0Oj4 SP - 101 EP - 114 VL - 36 IS - 1 DP - DeepDyve ER -