Abstract A crucial part of the design of mobile user interfaces is their evaluation with users. Although user studies are widely used for measuring the usability of a software system, they are time-consuming and expensive. In this study, we propose a novel approach which is based on the computational analysis of an app’s visual appearance. By calculating metrics which measure the complexity of the user interface, it should be possible to draw conclusions about the perceived usability. Automating this practice would be beneficial to developers and designers, as it would help to find usability issues early and reliably in the software development life cycle. However, there is no tool available yet which allows to automatically quantify the complexity of a mobile application’s user interface with metrics. We analyze the complexity based on screenshots of the user interface as well as on interaction data, without the need to access the source code of the application. We introduce a set of complexity metrics, which are partially based on existing metrics for the visual appearance of a user interface. In order to evaluate the quality of our metrics, we conducted a user study with several mobile apps. The results show that there are significant correlations between some of the proposed metrics and the perceived usability and workload. RESEARCH HIGHLIGHTS Automatic User Interface Complexity Calculation for Mobile Applications No need to have access to the source code of the application under test Collecting data automatically using a background service A multitude of complexity metrics is analyzed Metrics: Layout, Color, Typography and Consistency Computer Vision algorithm performs visual analysis of the captured screenshots User Study using NASA TLX and SUS questionnaires Result: we found correlations between the calculated complexity metrics and the subjective user’s assessments 1. INTRODUCTION User interfaces are increasingly important with the rise in number of users and applications. Simultaneously, users are becoming less willing to interact with difficult or uncomfortable interfaces. Therefore, high usability is wanted, however, it does not just ‘magically’ appear. To make sure that applications are usable, usability concerns should be actively addressed in the software development process (Nielsen, 1992). Gong and Tarasewich (2004) suggest that mobile user interfaces differ desktop user interfaces and therefore need to be evaluated differently. Nowadays, user interface developers and designers are confronted with the complexity of human–computer interaction (HCI) in many ways. A number of different mobile operating systems, each with their own user interface design principles, recommendations and guidelines, can lead to inconsistent and therefore less user-friendly applications in the vast app universe. Bevan and Curson (1997) propose usability measurements to be beneficial for the following reasons: Predict, ensure and increase the quality of the product; control and make improvements to the development processes; make the decision if a software product is acceptable; and choose a product from alternatives. Several different ways of measuring the usability of a visual user interface exist, including formal, heuristic and manual testing. However, unlike typical software, some of those evaluation methodologies may depend only on users and may never be automated or quantified. The challenge of calculating metrics relies on the fact that it is combined with other metrics, such as user-related actions like thinking, response time, the speed of typing or moving the mouse, etc. (Alsmadi and AlKaabi, 2009). According to Henry and Selig (1990), metric analysis should be conducted as early as possible in order to reduce the duration of the software development process and thus increase its efficiency. Traditionally, user studies are needed to determine the usability of a product. However, user studies for evaluating the usability are costly and time consuming. We observe a lack of possibilities to quantify the complexity of mobile user interfaces with metrics, without the need for conducting studies with human participants. A system for automatically calculating complexity metrics would make user interfaces quantifiable and thus allow for easily comparing different versions of an application or different apps among each other (Riegler and Holzmann, 2015). The goal was to find metrics that are measures for the complexity of a mobile user interface. This is similar to software complexity metrics, which are widely used in software development. Desktop and web applications have been analyzed regarding their lines of code, number of user interface widgets, etc. Complexity metrics, such as the Cyclomatic Complexity by McCabe (1976), Lines of Code, the Halstead complexity measures (Halstead, 1977) as well as structure metrics such as the Information-Flow metric by Henry and Kafura (1981), have been used for computing the complexity of a software project. Similarly to software complexity metrics, UI complexity metrics are mostly concerned with structural/layout aspects (Alemerien and Magel, 2014). UI complexity metrics are especially of interest for mobile devices, as they pose a number of challenges for app developers. According to Zhang and Adipat (2005) such challenges are mobile context, connectivity, screen size, different screen resolutions, limited processing power and data entry methods. Mobile device manufacturers and operating system providers have been enforcing their own usability rules. Thus, the user interface has a bigger impact for the usability of apps of mobile devices, and design guidelines for desktop applications cannot be directly applied to mobile applications. It is important to find new or adopt existing research methodologies that can measure the usability of mobile applications. According to Tullis and Albert (2013), quantifiable metrics show whether a software product has been improved from one version to the next. This helps developers and designers to spot potential issues early in the software development life cycle and to improve their product. Another advantage of measuring the user interface while not requiring access to the applications’ source codes is that any apps can be analyzed and compared. The result is that qualitative statements about apps can be made, such as: The user interface of app X leads to longer task completion times than the user interface of app Y. App Z has an inconsistent user interface in regards to background and foreground colors. App X uses an action bar for important tasks but app Y displays the same tasks in a list. Mobile application developers, already facing a worldwide competition in the crowded app markets, should know the needs of their customers. Knowing the usage of the app, the developers can determine the best way to satisfy their customers (Vannieuwenborg, 2012). Especially when considering a saturated app market, building better apps by quantifying and subsequently improving the user interface can result in a competitive advantage. It is necessary to find metrics of the user interface that are regarded as influential by the users to the subjective usability. The metrics must be both relevant and meaningful for measuring the user interface complexity. We developed a prototype for Android with the purpose of calculating the user interface complexity of any mobile application; may it be a native, web or hybrid application. We evaluated our prototype with a user study that lead us to draw interesting conclusions. In the following sections, we first present related projects and describe our overall approach. Afterwards, the proposed metrics for quantifying the user interface complexity of mobile applications are described in detail. Finally, the setup of a user study for evaluating our approach is described, and its results are discussed in detail. 2. RELATED WORK This section presents related projects concerning the relation of calculated UI complexity and perceived usability as well as the quantification of visual UI complexity with metrics. Taba et al. (2014) analyze the correlation between UI complexity and user-perceived quality in Android applications, eventually aiming to propose guidelines for UI complexity by mining available Android applications. The layout must be declared using an XML file. If the layout is instantiated programmatically, the proposed approach is not realizable. Taba et al. use apktool to extract the content from the APKs, which contain the layout files, the AndroidManifest.xml and more contents packaged together into an Android executable file. The decoded APK files are subsequently explored. The metric calculation is based on the number of inputs, outputs and elements for a given Activity. The perceived usability is obtained from the Google Play Store ratings and reviews. Taba et al. recognize that there is a relation between UI complexity and user-perceived quality. Users tend to grade screens with high user-perceived quality better which relates to the simpler UI complexity of that screen. The apktool used for gathering GUI information seems on the one hand very useful, especially for the purpose of automating the process. On the other hand, however, the apktool only considers the static user interface layouts, not UI elements, alert dialogs, etc. added dynamically into the user interface. Then the apktool fails to collect the relevant data which distorts the results. Alemerien and Magel (2014) present a complexity evaluation tool that aims to help developers analyze interface designs for desktop applications at the early stages of the development process using metrics. Using structural complexity metrics is considered important by the authors, as improving user interface quality and project controllability can be enhanced by controlling the interface complexity by measuring the related aspects. Additionally, as the user interface is an essential component of software applications, focussing on making the interaction between human and software more seamless and as effortless as possible seems logical. The metrics model consists of five structural interface complexity measures: Alignment, which measures the vertical and horizontal alignment of interface objects Grouping, which measures the number of objects that have a clear boundary Balance, which measures the number and size of objects for each quarter of the screen Density, which measures the screen occupation by objects Size, which measures the objects sizes The tool is used to determine the layout complexity in order to evaluate the quality of the interface design. GUIEvaluator is written in Visual Basic 2012 and extracts the interface layout information using reflection techniques supported by Visual Basic 2012. Based on a strong positive correlation between the GUIEvaluator’s metrics and a study where participants rated the interface complexity on a Likert scale, Alemerien and Magel conclude that automating the evaluation of user interfaces in early stages is critical. Similarly, Fu et al. (2007) measure the screen complexity of web pages for usability analysis. Size complexity, local density, grouping and alignment are calculated for this purpose. However, the web pages are first translated manually into model screens that only contain the structure of the webpage without any content. The complexity values from the model screens are then compared with the viewers’ judgment from the real screens. Zen and Vanderdonckt (2014) consider more metrics: balance, sequence, equilibrium, unity, symmetry, proportion, economy, density, homogeneity, regularity, rhythm and order. Zen et al. claim that while the contents are important, the ‘look and feel’ is an equally essential aspect of the GUI quality that is impacted by several factors such as—but not limited to—aesthetics, pleasurability, fun, etc. Thus, aesthetics, which we see as part of the visual user interface, is a possible aspect to focus on in order to encourage user–device interactions. Zen et al. introduce a simplifying model of GUIs aesthetics that considers these aspects and regions-related metrics. Using a systematic approach to GUI aesthetics makes it possible to define a set of metrics to generate GUI recommendations. QUESTIM, the tool proposed by Zen et al., allows the user to load a webpage and the user has to draw regions of interest that will be used as basis for the user interface complexity calculation. A study concluded that 4 out of the 12 metrics were ranked mainly similarly indicating that metrics formulas were representative of the users’ perception. Those metrics turned out to be balance, equilibrium, density and economy. Zen et al. do not take into consideration other potential aesthetics aspects such as color combinations and typographic complexity. Sears (2001) recognizes that the automation of the design and evaluation process or parts thereof can tackle some usability issues early, therefore reducing development costs. In addition, designers can concentrate more on other parts of the design process. Sears et al. propose semi-Automated Interface Designer and Evaluator (AIDE), a metric-based tool that helps designers in creating and evaluating different user interface layouts. Regarding the visual UI analysis, analogous to Alemerien and Magel (2014) and Fu et al. (2007), alignment, balance and density metrics are calculated. Comber and Maltby (1995) present an approach that aims to evaluate layout complexity metrics by measuring the usability of different screen designs. A Visual Basic application called Launcher was developed that measured complexity and obtained usability data. A layout has a minimum complexity, with all objects having the same size and being fully aligned to a grid, whereas maximum complexity is present at the user interface when every object has a differing size and objects are misaligned. Tullis (1981, 1983) and Bonsiepe (1968) argue that decreasing complexity increases usability. However, previous studies show that users do not prefer ‘simple’ screens. A screen with minimal complexity is considered boring to look at and has difficulties in using size and position properties to express the function of objects. Contrary, a screen with maximum complexity is also not good as it can be visually confusing and less productive to use (Comber and Maltby, 1995). The results from the study conducted by Comber et al. showed differences in usability between screens with different layout complexity. Screens with a mid-range complexity scored better than the screens at either the low or high end of the complexity scale. Comber et al. conclude that a complexity metric has the potential to indicate screen design problems early in the software development life cycle and to direct help to the inexperienced or casual programmer. Ma et al. (2013) present a toolkit for usability testing of mobile applications. The toolkit embeds the ability to automatically collect user interface events as the user interacts with applications. The source code of the app under test must be modified in order to call the toolkit’s event listening methods for logging ongoing user interaction events, as well as their timestamp and properties of the relevant windows. However, the toolkit proposed by Ma et al. (2013) does not evaluate the visual appearance complexity of the user interface, but rather navigational issues (e.g. number of backtracks, incorrect flows) by the user. As shown in Table 1, some of the researched papers aim to analyze the visual structure of the interface, such as the number of controls and their position, size and alignment. However, none of them takes into account consistency and non-structural aspects of the UI, such as color and typography. Other toolkits focus on the capture of user interactions such as number of clicks or scrolls, while we focus on the visual aspects of the user interface. Additionally, many of the proposed tools need access to the application’s source code, either referencing the system or even requiring a deep integration of the tool in the application’s source code, which our proposed tool does not. Furthermore, our approach automatically calculates the UI complexity to easily facilitate continuous improvements in the software design process, which is not the case for all related projects. The automation of the user interface complexity analysis makes it easier and faster for app developers and designers to iterate over different designs and improve them with each iteration. Our approach is also ready to be used with mobile devices, whereas just one of the related projects focused on the analysis of mobile UI. If the mobile application is a native, web or hybrid application does not influence the outcome of our proposed user interface analysis tool, since the analysis is based on the visual appearance, and not on the underlying source code. For instance, if user interface elements such as buttons or text fields are created using layout files or programmatically does not matter using our approach. Table 1. Comparison of different approaches to determining user interface complexity. Criteria Taba et al. Alemerien et al. Fu et al. Zen et al. Sears Comber et al. Ma et al. Calculates visual metrics o o o o o o x Requires no source code x ✓ ✓ ✓ x x x Provides automation ✓ ✓ x x ✓ ✓ ✓ For mobile apps ✓ x x x x x ✓ Criteria Taba et al. Alemerien et al. Fu et al. Zen et al. Sears Comber et al. Ma et al. Calculates visual metrics o o o o o o x Requires no source code x ✓ ✓ ✓ x x x Provides automation ✓ ✓ x x ✓ ✓ ✓ For mobile apps ✓ x x x x x ✓ View Large Table 1. Comparison of different approaches to determining user interface complexity. Criteria Taba et al. Alemerien et al. Fu et al. Zen et al. Sears Comber et al. Ma et al. Calculates visual metrics o o o o o o x Requires no source code x ✓ ✓ ✓ x x x Provides automation ✓ ✓ x x ✓ ✓ ✓ For mobile apps ✓ x x x x x ✓ Criteria Taba et al. Alemerien et al. Fu et al. Zen et al. Sears Comber et al. Ma et al. Calculates visual metrics o o o o o o x Requires no source code x ✓ ✓ ✓ x x x Provides automation ✓ ✓ x x ✓ ✓ ✓ For mobile apps ✓ x x x x x ✓ View Large Metrics other than those we use for the visual user interface complexity calculation can also be found. Examples for such metrics are sequence, equilibrium, unity, proportion, rhythm and order (Zen and Vanderdonckt, 2014). However, we chose to use metrics that have been used multiple times and were thoroughly tested. Our research focuses on the visual aspects of the user interface. We do not cover behavioral modeling of user interactions, which are mainly concerned with the economy or efficiency of the user interface. Logging and evaluating interaction data for this purpose has been covered before, such as automate—An Android toolkit for supporting field studies on mobile devices proposed by Holzmann et al. (2017), StateWebCharts proposed by Winckler and Palanque (2013) or the automatic detection of bad usability smells by Paternò et al. (2017). 3. CONCEPT 3.1. Overview In this article, we present a system that can calculate user interface complexity metrics for Android apps. The data needed for calculating the metrics, i.e. the screenshots of visited app screens as well as the dwell time on each screen, is collected from a background service running on the mobile device. For the calculation of the complexity metrics from the captured screenshots, computer vision techniques are used. Additionally, we describe how the required data is collected and how the calculated metrics are visualized for a convenient evaluation of applications. A main feature of our approach is that it uses the Android background service, the AccessibilityService to monitor and capture user interaction events. In contrast, an aspect oriented programming approach would require the developer to have the bytecode of the app, as additional code is compiled to the final APK package file. However, this bytecode is usually not available from third parties. Therefore, the approach without access to the source code or binary file at all gives the most flexibility for usability evaluations. This amounts to enormous time savings without developers having to worry about the additional compilation process, which additionally is error-prone. The recorded data is sent to a web server for permanent storage and complexity analysis. Other UI complexity calculation approaches either use libraries that monitor user interactions (Ma et al., 2013) or analyze the layout files by decompiling the Android APK file (Taba et al., 2014). As shown in Fig. 1, the proposed system is split into three major parts: (i) the Android background service which collects interaction data and sends it to a server (Capture component), (ii) the server backend based on Ruby on Rails which represents the API endpoint. The collected data is stored in an SQLite database (Storage component). The Visual Analysis component is an OpenCV Java application, which is running on the server backend, and is responsible for calculating the UI complexity metrics. (iii) The server frontend based on Node.js visualizes the metrics (Presentation component). Transmission of data between the mobile client and the backend is accomplished using HTTP posts. Figure 1. View largeDownload slide Architecture of the proposed system. Figure 1. View largeDownload slide Architecture of the proposed system. The workflow from capturing the relevant data to calculating the metrics values and presenting them is as follows: Capturing user interaction: The user interacts with the mobile application while the background service captures screenshots. As no source code changes are made in the target application, the UI complexity analysis is solely based on the captured screenshots. Calculating complexity metrics from the visual appearance: The screenshots are used for measuring the UI complexity such as the alignment and grouping of objects and the consistent use of colors and fonts throughout the application. Displaying complexity metrics: A Kiviat graph is used as visual representation for the complexity metrics, which provides a comfortable means for comparing a user interface with that of competitors’ apps or analyzing the effect of changes made in the interface. 3.2. Components The workflow described above is realized with four functional components, which are described in more detail in the following. 3.2.1. Capture component The Capture component records the user’s interactions by using Android’s AccessibilityService and captures screenshots when interactions occur. The core mechanism is the AccessibilityService which must be manually enabled on the target smartphone by the user, in the device settings. Therefore, the smartphone does not have to be rooted or flashed with a different Android image. Furthermore, any privacy concerns the background service might arise are thus negated because the user has to manually activate this service and it can be disabled at any time. Using the onAccessibilityEvent method from the Accessibility API, each interaction triggers a screenshot, which is taken after a short delay to avoid capturing animations. AccessibilityEvents express some state transition in the user interface, such as a button click or the change of the window content. We make use of the following AccessibilityEvents: TYPE_TOUCH_INTERACTION_START TYPE_TOUCH_INTERACTION_END TYPE_VIEW_CLICKED TYPE_VIEW_FOCUSED TYPE_VIEW_LONG_CLICKED TYPE_VIEW_SCROLLED TYPE_VIEW_SELECTED TYPE_WINDOWS_CHANGED TYPE_WINDOW_CONTENT_CHANGED TYPE_WINDOW_STATE_CHANGED We chose these AccessibilityEvents as they covered all required interactions that lead to UI changes (e.g. scrolling a list, activating a checkbox, appearance of a dialog). The collected data is stored locally. Whenever AccessibilityEvents occur, a screenshot is taken after a short time delay in order to avoid capturing animations or screen transitions. Every screenshot is assigned an identifier for the subsequent analysis. Throughout this process, the data are temporarily stored on the user’s mobile device. After finishing the navigation through the app, the collected data are sent to the Storage component on the web server using HTTP posts. As there is potentially a large number of screenshots, the bulk of collected data is compressed before uploading it to the web server. 3.2.2. Storage component The Storage component is part of the web server and is responsible for storing the collected data, i.e. the screenshots, on the file system for further analysis by the Visual Analysis component. The received data, which are packed as a zip file, are unpacked and put into a file directory on the web server. 3.2.3. Visual Analysis component This component analyzes the user interface according to the captured screenshots of the app. The component accesses the relevant data from the Storage component. The computer vision library OpenCV1 is used for the visual analysis of the captured screenshots. OpenCV is written in C++, but a Java interface exists, which was used for the implementation. The computational efficiency and large number of available algorithms made OpenCV the first choice for our prototype. In order to calculate the visual complexity metrics defined below, visual information must be extracted from the screenshots, which will be subsequently described in detail. The analysis is performed on all screenshots automatically. Region detection is used to find relevant parts of the captured screenshots that can further be analyzed. Detected regions can be interaction elements, such as buttons or text fields, but also labels, table cell separators, etc. Each detected region is represented as a rectangle. Figure 2 shows the process of retrieving regions from a screenshot. The first picture shows the original screenshot, the second and third display regions detected with the vertical and horizontal Sobel filter, respectively, and the final figure shows the combination of the vertical and horizontal Sobel filter to connect intersecting regions. Figure 2. View largeDownload slide Detecting regions using Sobel operations. Regions are displayed as green rectangles. Figure 2. View largeDownload slide Detecting regions using Sobel operations. Regions are displayed as green rectangles. For region detection purposes, it is sufficient to convert the color image to a grayscale image because the color information is not necessary for the subsequent edge detection. The edge detection is performed for both vertical and horizontal edges and is applied using the Sobel edge detection algorithm. Using convolution operations, the horizontal and vertical edges are recognized, resulting in a new image consisting of the detected edges for further analysis. Subsequently, image thresholding using the Otsu (1979) algorithm is applied, resulting in a black and white image. Afterwards, a morphological operation, dilation, is used to combine edges that are close to avoid having to deal with a large number of small edges. The final step is to find the contours in the black and white image and save each region in a list. Each region has an x/y coordinate on the image as well as a width and height. All user interface complexity calculations, described below in detail, are based on these regions (Fig. 3). The main OpenCV characteristics we most commonly used are greyscale conversions, image thresholding, edge detection, morphological operations (especially dilation) and color detection. Figure 3. View largeDownload slide Schematic representation of the layout metrics Element Smallness, Misalignment, Density and Imbalance, from left to right. Figure 3. View largeDownload slide Schematic representation of the layout metrics Element Smallness, Misalignment, Density and Imbalance, from left to right. A variety of visual aspects is taken into consideration, as explained below in detail. The metrics are calculated from screenshots captured by the Capture component. Whenever the screen content changes, a screenshot is taken. A simple metric is the number of user interface elements. As this metric can be different for each screen, we had to find a way for calculating an overall metric for the respective mobile application. A simple solution would be to determine this metric for each screen and calculate the arithmetic mean, which would be the mean number of UI elements across all screens. However, this approach would not take into account the user’s dwell time on each screen. We consider this an important aspect, as a screen, on which the user spends most of the time, should have a higher impact on the perceived complexity of the whole application than the other screens. Therefore, we weight each metric xi (which is replaced with, e.g. the number of UI elements Ui or the element smallness Ei), which can be computed separately for each screen s, with the time ts,k user k spent on screen s during a user test, and calculate the weighted arithmetic mean over all screens S. This resulting metric of each user is then used for calculating the metric x over all users K using the arithmetic mean. The formula is shown in Equation 1, where K denotes the number of users, Tk the overall time spent in the application and S the number of screens. x=1K∑k=1K1Tk∑s=1Sxs·ts,k (1) Please note that Equation (1) is used for all of the following metrics except for inconsistency, as this metric is calculated from the variation of all other metrics values over all screens of an application. Layout The layout metrics are concerned with the structural display of the user interface elements such as buttons and labels. Based on the works of Galitz (2007), Vanderdonckt and Gillo (1994), Fu et al. (2007), Alemerien and Magel (2014), Zen and Vanderdonckt (2014) and Taba et al. (2014), we have chosen five layout-specific metrics x which are calculated for a certain screen s. However, we found that there was a lack of formulas for calculating the complexity metrics based on the visual appearance. Therefore, we define the visual user interface complexity formulas as follows. Number of UI elements: This metric, which is denoted by Us, sums up all recognized user interface elements such as buttons or text fields on a certain screen s. Recognizing the elements is the first and most important step of the visual analysis, as other metrics described in the following rely on the identification of these UI elements. Each detected UI element is defined by its x/y coordinate on the screen as well as its width and height. Element smallness: This metric, which is denoted by Es, is a measure for the size of UI elements on a certain screen s. For calculating this metric, the screen is divided into groups. A group g is a defined region of interest (Brinkmann, 1999) on the screen, which contains one or more UI elements belonging together. For example, the items in an Android action bar can be considered as belonging to the same group. There are various criteria for the formation of groups, such as common background color, position on the screen or functionality. In the first version of our prototype, grouping is done solely based on differing background colors. First, the mean element width and height of each group g of a screen s is calculated from the widthu,g,s and heightu,g,s of each element u in this group by calculating the arithmetic mean over all elements Ug,s. For calculating the mean element widths¯ and heights¯ for a screen s, we weighted each element width and height of a group g with the area Ag,s and calculated the weighted arithmetic mean over all groups Gs. We used the area as weight as larger groups should have a larger visual impact to the user than smaller groups of elements. The formula is shown in the following equations. widths¯=1As∑g=1GsAg,s1Ug,s∑u=1Ug,swidthu,g,s (2) heights¯=1As∑g=1GsAg,s1Ug,s∑u=1Ug,sheightu,g,s (3) Second, widths¯ and heights¯ have to be combined to a single metric Es. As smaller objects tend to be harder to identify and interact with, it is important that this metric results in a higher score for smaller elements and vice versa. Therefore, we normalized widths¯ and heights¯ between 0 and 1 using the widths and heights of the screen s as shown in the following equation. Es=1−widths¯widths+1−heights¯heights2 (4) Misalignment: Alignment is concerned with lining up the borders or the center of UI elements on a screen. A layout with aligned elements is considered more aesthetic than lesser-aligned elements (Zen and Vanderdonckt, 2014). The possible number of alignments ng,s of elements in a group g on a screen s is calculated by summing up the number of neighboring elements that are either horizontally (i.e. with their top and/or bottom border) or vertically (i.e. with their left and/or right border) aligned, and those that are centrally aligned. The misalignment metric calculation is done by comparing the positions of the bounding boxes of two neighboring UI elements at a time. The bounding boxes are calculated using edge detection algorithms on the captured screenshots. An alignment is detected by comparing the x/y coordinates, the (x+width) or (y+height) coordinates and/or the center coordinates of the bounding boxes of two neighboring UI elements. For example, numg,s becomes 6 for four equally sized and centrally aligned elements in a row, as three pairs of neighboring elements are both horizontally (with their top and bottom border) and centrally aligned. For differently sized elements, however, this example would result in a value of 3, as the borders of neighboring elements are not aligned any more. In order to calculate the misalignment metric Ms, the mean number of horizontal ( numhs¯), vertical ( numvs¯) and central alignments ( numcs¯) of pairs of neighboring elements of a screen s is calculated. Equation (5) shows the formula for numhs¯. The number of horizontal alignments numhg,s in a group g is related to the number of possible alignments (horizontal, vertical and central) in a group, which are denoted with numg,s. The mean number of vertical and central alignments numvs¯ and numcs¯ is calculated analogously. Equation (6) shows the calculation of the misalignment metric Ms, which is an aggregation of the horizontal, vertical and center alignment metrics as described above. numhs¯=1As∑g=1GsAg,snumhg,snumg,s (5) Ms=1−(numhs¯+numvs¯+numcs¯) (6) Density: The density metric shows how cluttered or empty the user interface is. A large density value means that the UI elements in total occupy much space on the screen, which could be confusing to the users. A small value for the density metric means that there is much empty space on the screen. Equation (7) shows that the areas Au,s of all UI elements Us are summed up and related to the area As of the entire screen s, resulting in the density Ds. Ds=1As∑u=1UsAu,s (7) Imbalance: The imbalance metric uses the position of UI elements to determine how they are distributed over the screen. Basically, the imbalance metric calculates the margins between UI elements and their neighboring elements or borders. For each UI element, the distances to the four horizontal and vertical neighbors are considered. A neighbor in this regard is the closest UI element or group border in the respective direction (i.e. left, right, top and bottom). A balanced user interface has a consistent distribution of objects and white space between neighboring objects, i.e. the objects are evenly spread. For each UI element u of each group g on a screen s, the horizontal margins (i.e. left and right margin, resp. marlu,g,s and marru,g,s) and vertical margins (i.e. top and bottom margin, resp. martu,g,s and marbu,g,s) to the neighboring UI elements and borders are measured. Afterwards, the maximum horizontal and vertical margin among all margins within a group is determined. The mean horizontal margin per group is calculated and related to the maximum horizontal or vertical margin in the group, resulting in balhg,s (Equation (8). This is done analogously with the vertical margins, resulting in balvg,s. As done for the previous layout metrics, the mean balance is weighted with the group size Ag,s, as shown in Equation (9) in case of the mean horizontal balance, balhs¯. The computation of the mean vertical balance balvs¯ is done analogously. balhg,s=1Ug,s∑u=1Ug,smarlu,g,s+marru,g,s2·marhmaxg,s (8) balhs¯=1As∑g=1GsAg,s·balhg,s (9) In order to calculate the imbalance value, we use the mean horizontal and vertical margin and weight them equally, finally subtracting the result from 1 to retrieve the imbalance Bs (as opposed to the balance), as shown in Equation (10). Bs=1−balhs¯+balvs¯2 (10) For example, if the horizontal margins in a group are about the same (i.e. close to the maximum horizontal margin), the UI elements are considered horizontally balanced. On the contrary, if UI elements are placed in the left half of the group, the large horizontal margin to the right border leads to a large imbalance score. Color complexity The use of color in a user interface has been shown to improve performance (Kopala, 1981), to improve visual search time (Christ, 1975; Carter, 1982; Nagy et al., 1992), to facilitate organizing information (Engel, 1980) and to aid memory (Marcus, 1986). According to Tullis (1981), color has also created positive user perceptions, was preferred to monochromatic screens for being less monotonous and reducing eyestrain (Christ, 1975), and is more pleasant (Marcus, 1986). However, research has found that as the number of colors on a display is increased, the response time to a single color rises and color confusion is more likely (Luria et al., 1986). Several studies, such as one conducted by Brooks (1965), have found that the maximum number of colors that a person can handle is in the range of 4–10, while lower numbers should be encouraged. Galitz (2007) and Sidorsky (1982) recognize the importance of color in user interfaces. Color adds dimension, or realism, to screen usability and therefore attracts a user’s eye. If used appropriately, it can focus on the logical organization of information, facilitate the differentiation of different objects, highlight differences among elements, and make displays more pleasing. If used inappropriately, color can be distracting and eventually diminish the system’s usability. Similarly, Lalomia and Happ (1987) assert that effective foreground/background color combinations are of vital importance. In this study, the color metric measures several aspects of the user interface: Number of dominant colors c1,c2,… on screen s, numcs Average color combination match of the dominant colors, ccms¯ Color combination match chs,hs+1 of the current screen’s histogram hs and the succeeding screen’s histogram hs+1 The number of dominant colors for screen s, numcs, has to be determined first. Dominant colors are colors that are occurring more frequently on a pixel-basis than non-dominant colors. We applied k-means clustering to detect the dominant colors. As k, i.e. the number of clusters, is not known beforehand, the k-means clustering algorithm is applied. It starts with k=2 and utilizes the Elbow method to determine the change in variance from k to k+1 (Ketchen and Shook, 1996). The dominant colors are compared pairwise to calculate all possible numcs2 color combination matches. The final color combination match metric, ccms¯, is calculated as the average of these matches. First, the RGB colors are converted into the HSV (hue, saturation, value) color space (Cardani, 2001). The complexity value of the color combination match ( ccms) is calculated as a weighted difference of the H, S and V values of the two colors. The more different the dominant colors on a screen s are, the bigger the value for ccms¯. After a series of tests, we decided for a strong emphasis on hue (H) difference (70%) and smaller impacts of saturation (S) and brightness (V) differences (15% each) as weights for Equation (11). Hmax, Smax and Vmax denote the maximum H, S and V value, depending on the used HSV scale. ccmc1,c2=0.70·δHc1,c2Hmax+0.15·Smax−δSc1,c2Smax+0.15·Vmax−δVc1,c2Vmax (11) If a color pair consists of two grey colors, i.e. they have a saturation of zero, then only the value (brightness) difference δvc1,c2 is considered. For each screen, the number of used colors numcs as well as the mean color combination match ccms¯ is determined. For an app-wide analysis, the color histogram hs of each screen s is computed. The comparison of the current with the following screen’s histogram is performed using the Bhattacharyya distance. If the histograms are similar regarding the distribution of the used colors, the distance is small. That means, that even if all single screens have visually pleasant color combinations, but the used colors are not consistently used throughout the app, this problem is detected. Equation (12) shows the calculation of the color complexity metric, where numcmax denotes the maximum number of dominant colors on a certain screen of the whole user interface. Cs=numcsnumcmax+(1−ccms¯)+chs,hs+13 (12) Typographic complexity Typography plays an important role in application design. In a font-size test conducted by Beymer et al. (2008), there were significant differences found regarding the usage of smaller and larger fonts. The result of the experiment concludes that smaller fonts lead to a higher fixation duration and thus reduce the reading performance. Choosing a larger font size increases legibility. Regarding text/background color combinations, Humar et al. (2008) found that higher luminance contrasts resulted in better legibility. One finding was the polarity (the luminance relation between the text and the background) of the contrast had a major impact on legibility. Typographic aspects that are analyzed include: Number of different text sizes, or heights of text elements, numts Average text size, ts¯, which is determined by calculating the average height of all text elements found in screen s Average text foreground/background color combination match, ccms¯ (see previous subsection) The calculation of the text foreground/background color combination match, ccms also utilizes the k-means clustering algorithm, but this time knowing k=2, i.e. the foreground color and the background color. The calculation of the color match is the same as described in the previous section. The typographic complexity metric calculation is shown in Equation (13), where numtmax denotes the maximum number of text sizes and tmax denotes the maximum text size, both on a certain screen of the whole user interface. Ts=numtsnumtmax+1−ts¯tmax+(1−ccms¯)3 (13) Inconsistency Nielsen (1989) states that one of the most important characteristics of usability is consistency in user interfaces. Consistency can be applied on various levels, including the individual application, across a product family, for all products released by a vendor and for all products running on a specific device (Nielsen, 1989). According to Shneiderman and Plaisant (2004), consistency primarily refers to common action sequences, terms, units, layouts, colors, typography and so on within an application. Galitz (2007) states, that a clear and clean organization, which makes use of consistency, results in easier to recognize screen’s main objects and to ignore its secondary information when needed. A study to examine the effect of colors, fonts and locations of the consistency of a website interface regarding the user performance and satisfaction was conducted by AlTaboli and Abou-Zeid (2007). The metrics completion times, number of errors and satisfaction scores were measured. Regarding completion times, the study showed that consistency helped achieve better results, but there is no statistically significant difference among the completion times. Furthermore, the study showed that the number of errors increased when fonts and locations were physically inconsistent. Regarding user satisfaction, inconsistencies concerning the color use led to a significant decrease in satisfaction. While the layout, color and typographic complexity metrics are strongly based on the analysis of a single screen, the inconsistency metric aims to determine if the entire user interface is visually aesthetic. Therefore, the above metric values, which are calculated for each screen s, are compared in order to find discrepancies such as outliers. This comparison is done by calculating the standard deviation. In order to normalize the result in a 0–1 range, we divide the result of the standard deviation calculation by xmax, which denotes the maximum value for a certain metric over all screens (see Equation (14)). xsn denotes complexity value x on screen s and metric n. N is the number of metrics per screen, i.e. 7 (number of UI elements, element smallness, misalignment, density, imbalance, color complexity and typographic complexity) and xn¯ is the mean complexity value for metric n over all screens, e.g. the average density. I=1N∑n=1N1xmaxn1J−1∑j=1J(xjn−xn¯)2 (14) 3.2.4. Presentation component The Presentation component is the web frontend that allows for visually exploring the different user interface complexity metrics calculated by the Visual Analysis component. It displays a set of metrics by means of Kiviat graphs. Kiviat graphs allow a fast communication of information (Morris, 1974). According to Pinzger et al. (2005), they can be used and are suited for presenting multivariate data such as the feature vectors obtained from several releases of source code and release history data (Pinzger et al., 2005). Since a single complexity score may not mean much to a developer or designer, and the evolution of user interfaces is an ongoing process, the real power of metrics lies in the comparability of said metrics in order to enable improvements. Therefore, it should be possible to create an evolution of complexity measures (Pinzger et al., 2005), which tells the developer or designer how the complexity metrics changed over time. Since one graph shows multiple versions at once, Kiviat graphs are an excellent data representation technique for metrics that change over time. Figure 4 shows an example of a Kiviat graph that consists of the complexity metrics used in this study. The Kiviat graph visualizes the UI complexity of two apps. The solid line displays the metrics of the first app, NFC Tools, while the dashed line displays the metrics of the second app, Ultimate To-Do List (see Section 4.6 for more details). Substantial UI complexity differences were found in the areas of element smallness, color complexity and inconsistency. NFC Tools achieves generally better, i.e. smaller, complexity values. Additionally, Ultimate To-Do List also has more UI elements in the groups and smaller UI elements, as represented by the element smallness metric. Ultimate To-Do List is less consistent, uses more colors and achieves worse color combinations than NFC Tools, resulting in a higher overall complexity. However, NFC Tools scored a higher density complexity, meaning that the user interface is more densely populated. Figure 4. View largeDownload slide Kiviat graph visualizing the complexity metrics of two apps, NFC Tools and Ultimate To-Do List. Figure 4. View largeDownload slide Kiviat graph visualizing the complexity metrics of two apps, NFC Tools and Ultimate To-Do List. Such Kiviat graphs make complexity comparisons between different apps and different versions much easier and help identify user interface problems early on. 4. EVALUATION The goal of the user study was to evaluate whether the presented system was adequate for quantifying the user interface of mobile applications into a set of metrics. The metrics should help designers and developers in improving user interfaces. The focus of the user study was to find out if correlations between the calculated UI complexity and the usability as perceived by the participants can be made. 4.1. Participants As initial step to the validation of our approach, we selected 12 participants to take part in the study. Six of them were males and six females. All of them had experience in using an Android device, such as the device used for the study. All volunteers had at one point an Android phone as primary phone. Eight participants were currently using an Android phone as standard phone, four were using an iPhone. The participants were between the age of 22 and 57 years ( M=31.0, SD=10.3). None of the participants had knowledge in HCI, usability testing or other related subjects. Furthermore, all 12 participants were not mobile application developers or designers. 4.2. Apparatus All participants used the same device for the user study, an LG Nexus 4, having the following features: Operating system and version: Android 4.4.4 Screen resolution: 1280×768 Display size: 4.7 inches CPU: 1.5 GHz Qualcomm APQ8064 Snapdragon S4 Pro RAM: 2 GB Installed software for the user study: Android background service app used for capturing the screenshots Three analyzed apps: NFC Tools, an app for reading, writing and programming tasks on NFC tags2, Expense Manager, an app for managing expenditures, checkbooks and budgets3 and Ultimate To-Do List, an app for organizing and simplifying the use of basic To-Dos and complex projects4 Therefore, the Nexus 4’s capabilities are sufficient for interacting with the device while the background service records the relevant data without the participants experiencing any lags. 4.3. Procedure We chose the three apps because they were easy to use as well as highly recommended in the Google Play Store. In addition, they showed highly diverging user interface complexity values for our proposed metrics (see Table 2 for the results). We chose the tasks which are described below in detail to be easy to complete for the participants. Thus, we intended to reduce or even eliminate the possibility of participants being too focused on the underlying tasks or excessively navigating around the app, but rather concentrate on the user interface—as the intention of the presented research is the visual user interface complexity, not the economy or efficiency. Table 2. Complexity metrics for the three analyzed apps. App1 App2 App3 Number of UI elements 0.55 0.70 0.70 Misalignment 0.59 0.58 0.63 Imbalance 0.55 0.52 0.57 Density 0.38 0.31 0.27 Element smallness 0.16 0.33 0.35 Inconsistency 0.53 0.55 0.69 Color complexity 0.16 0.46 0.56 Typographic complexity 0.37 0.38 0.40 App1 App2 App3 Number of UI elements 0.55 0.70 0.70 Misalignment 0.59 0.58 0.63 Imbalance 0.55 0.52 0.57 Density 0.38 0.31 0.27 Element smallness 0.16 0.33 0.35 Inconsistency 0.53 0.55 0.69 Color complexity 0.16 0.46 0.56 Typographic complexity 0.37 0.38 0.40 View Large Table 2. Complexity metrics for the three analyzed apps. App1 App2 App3 Number of UI elements 0.55 0.70 0.70 Misalignment 0.59 0.58 0.63 Imbalance 0.55 0.52 0.57 Density 0.38 0.31 0.27 Element smallness 0.16 0.33 0.35 Inconsistency 0.53 0.55 0.69 Color complexity 0.16 0.46 0.56 Typographic complexity 0.37 0.38 0.40 App1 App2 App3 Number of UI elements 0.55 0.70 0.70 Misalignment 0.59 0.58 0.63 Imbalance 0.55 0.52 0.57 Density 0.38 0.31 0.27 Element smallness 0.16 0.33 0.35 Inconsistency 0.53 0.55 0.69 Color complexity 0.16 0.46 0.56 Typographic complexity 0.37 0.38 0.40 View Large All participants had to fulfill six tasks in total. Three apps were analyzed and each participant was asked to perform two tasks per app. The user study was conducted with one participant at a time while other participants were not present during the procedure. Only the current participant and the observer were present during each session. The observer was responsible for explaining the tasks and the subsequent questionnaires. The observer was not allowed to help the participant in using the apps for completing the tasks. For each session of the user study, ~1 h was scheduled to complete the entire procedure: Introduction: After welcoming the participant taking part in the user study, they were informed about the purpose of the study as well as the basic idea of the presented system. Questionnaire about experience: The participant was asked to complete a short pre-study questionnaire to gather information about their mobile phone experience as well as if they had any knowledge in the HCI domain which was summarized in the previous subsection. Training with the used device: The participant was given the mobile phone used for the user study. The participant could interact with it to familiarize themselves with the device. However, they were not allowed to open the three apps for the analysis. Tasks: For each app, two tasks, described below, had to be performed. After each task, a NASA TLX questionnaire had to be completed by the participant. At the end, the participant was asked to complete a final SUS questionnaire about the perceived usability of the respective app. Conclusion: The participant was thanked to have taken part in the user study. For NFC Tools, the following tasks had be completed: Write the phone number ‘1234’ as entry for the NFC tag. Delete the phone number entry. Write a task that makes a phone call to the phone number ‘1234’ when you approach the NFC tag. Delete the phone call task. For Expense Manager, the following tasks had to be completed: Add a new personal expense in the amount of ‘10’ Dollars, payee is ‘Mike’ and the category is ‘Travel’ with its subcategory ‘Taxi’. Delete the newly created expense entry. Change the currency to Euro. Add a new account called ‘business’. Add a recurring business activity: income is ‘100’ Dollars monthly and as description put ‘salary’. For Ultimate To-Do List, the following tasks had to to completed: Add a new item called ‘Socks’ in the folder ‘Xmas’ and set its due date to ‘Dec. 24, 2016’. Add two more items to the ‘Xmas’ folder: ‘Sweater’ and ‘Shoes’. Sort the items in the ‘Xmas’ folder alphabetically, A to Z. Set the default priority of a newly created task to ‘Medium’. Set the default week to start at ‘Monday’. Set the default folder to ‘Xmas’. 4.4. Design The experiment was a 3×2 within-subjects design. Therefore, in order to avoid order effects, half of the participants startet with task 1, the others started with task 2. In total, the amount of entry was 12 participants×3apps×2tasks/app=72 trials. For each user, we calculated the set of UI complexity metrics individually (see Section 4.6 for details). The metrics per screen are weighted with the relative duration in which each participant stayed on the screen. A study by Cowen et al. (2002) shows that there is interdependence between the duration on a screen, expressed by fixation durations using eye tracking devices, and usability. Hypotheses were that (i) there is a negative correlation between the UI complexity and the SUS scores, i.e. an increase in the UI complexity metrics’ values leads to declining SUS scores. (ii) There is a positive correlation between the UI complexity and the TLX scores. An increase in the UI complexity metrics’ values leads to increasing NASA TLX workload scores. 4.5. Questionnaires In total, three different questionnaires were used in the user study: one for gathering basic information about the participants and their knowledge and experience, one for reviewing the workload of each task, and one for determining the perceived usability of each app. The first questionnaire was completed at the beginning of the study and focussed on the participants’ basic information and if they had experience with HCI or user interface design and development. 4.5.1. NASA task load index The questionnaire for reviewing the workload of each task was handed out after the participant completed one task. It is identical for all tasks and all apps. The NASA task load index (NASA TLX) is a workload assessment tool based on the subjective answers from the participants. The measured components are Mental Demands, Physical Demands, Temporal Demands, Own Performance, Effort and Frustration. The NASA TLX has established itself as the tool for determining the workload in various human–machine environments. After having finished one task, the participant had to complete the first part of the NASA TLX questionnaire by assigning, e.g. how much mentally demanding the task was (on a scale from 1 to 20). The second step of the procedure was to compare each of the six components, or scales, with each other, according to which of the two options represents the more important contributor to workload for the task. The results show a weighted average of the ratings. They provide information about the impact of each scale (e.g. high mental demand, low frustration) as well as an overall score. NASA TLX has been used for a number of studies in the mobile domain, such as Lawson et al. (2013) and Lumsden and Brewster (2003). 4.5.2. System usability scale The last questionnaire for the participants to complete after the two tasks per app was the System Usability Scale (SUS). The SUS is an established tool for determining the usability of a system. Ten questions had to be answered by the participants on a Likert scale ranging from 1 (‘Strongly disagree’) to 5 (‘Strongly agree’). The result is a score between 0 and 100. A score above 68 means that the usability of the app is considered above average (Brooke, 2013). The reasons for choosing SUS were that it is free, very simple and short. More importantly, it has been found remarkably robust on various studies, such as Bangor et al. (2008), Bangor et al. (2009) and Bevan (2009). 4.6. Results and discussion Table 2 shows the complexity values for the three apps NFC Tools (App1), Expense Manager (App2) and Ultimate To-Do List (App3). A comparison between NFC Tools and Ultimate To-Do List is shown in Figure 4. It is evident that the user interface of Ultimate To-Do List is more complex based on the calculations in Section 3.2.3. The results of the user study make it possible to draw interesting conclusions about the proposed system’s UI complexity calculation capabilities. The results show that there are correlations between the user interface complexity metrics calculated by our prototype and the perceived usability stated by the participants in the SUS and NASA TLX questionnaires. The perceived usability was highest for the NFC Tools app with a SUS score of 77.4, while the Expense Manager received the second highest score, 67.2, with the Ultimate To-Do List achieving the lowest SUS score, 55.5. Regarding the NASA TLX results, NFC Tools received an average score of 12.1 over both tasks, and the Expense Manager and Ultimate To-Do List app received average scores of 14.9 and 19.0, respectively. In order to find out if and how much the UI complexity metrics could have contributed to this result, we calculated the Spearman correlation for each metric in relation to the SUS as well as the NASA TLX scores. A large absolute value for the Spearman correlation coefficient ρ indicates a strong rank correlation between the variables. The correlation results are shown in Table 3. The correlation between the Inconsistency metric and the SUS as well as NASA TLX scores is statistically significant (P ≤ 0.05) for all apps. A larger inconsistency complexity metric resulted in lower SUS scores and higher NASA TLX workload scores which is consistent with Nielsen (1989) and Shneiderman and Plaisant (2004); consistency throughout the user interface is preferred by users and has a positive impact on usability. Table 3. Spearman correlations between the UI complexity metrics and the SUS scores resp. TLX scores. Complexity metric SUS—App1 SUS—App2 SUS—App3 TLX—App1 TLX—App2 TLX—App3 Number of UI elements ρ=−0.459 ρ=−0.645* ρ=−0.780* ρ=0.144 ρ=0.510* ρ=0.521 p=0.125 p=0.050 p=0.041 p=0.540 p=0.050 p=0.076 Misalignment ρ=−0.505 ρ=−0.188 ρ=−0.420 ρ=0.376 ρ=0.609* ρ=0.547* p=0.199 p=0.600 p=0.311 p=0.185 p=0.012 p=0.031 Imbalance ρ=−0.736* ρ=−0.087 ρ=−0.310 ρ=0.183 ρ=0.522* ρ=0.595* p=0.045 p=0.910 p=0.456 p=0.597 p=0.050 p=0.015 Density ρ=−0.786* ρ=0.000 ρ=−0.873** ρ=0.501* ρ=0.730** ρ=0.694** p=0.021 p=0.900 p=0.005 p=0.045 p=0.001 p=0.002 Element smallness ρ=−0.640 ρ=−0.033 ρ=−0.811* ρ=0.406 ρ=0.594** ρ=0.700** p=0.068 p=0.930 p=0.025 p=0.061 p=0.010 p=0.004 Inconsistency ρ=−0.812* ρ=−0.845* ρ=−0.800* ρ=0.312* ρ=0.668** ρ=0.784** p=0.012 p=0.013 p=0.017 p=0.042 p=0.004 p=0.002 Color complexity ρ=−0.190 ρ=0.052 ρ=−0.241 ρ=0.401 ρ=0.722* ρ=0.155* p=0.555 p=0.540 p=0.644 p=0.092 p=0.034 p=0.040 Typographic complexity ρ=−0.833* ρ=−0.780* ρ=−0.166 ρ=0.031 ρ=0.596* ρ=0.623* p=0.022 p=0.050 p=0.500 p=0.650 p=0.020 p=0.049 Complexity metric SUS—App1 SUS—App2 SUS—App3 TLX—App1 TLX—App2 TLX—App3 Number of UI elements ρ=−0.459 ρ=−0.645* ρ=−0.780* ρ=0.144 ρ=0.510* ρ=0.521 p=0.125 p=0.050 p=0.041 p=0.540 p=0.050 p=0.076 Misalignment ρ=−0.505 ρ=−0.188 ρ=−0.420 ρ=0.376 ρ=0.609* ρ=0.547* p=0.199 p=0.600 p=0.311 p=0.185 p=0.012 p=0.031 Imbalance ρ=−0.736* ρ=−0.087 ρ=−0.310 ρ=0.183 ρ=0.522* ρ=0.595* p=0.045 p=0.910 p=0.456 p=0.597 p=0.050 p=0.015 Density ρ=−0.786* ρ=0.000 ρ=−0.873** ρ=0.501* ρ=0.730** ρ=0.694** p=0.021 p=0.900 p=0.005 p=0.045 p=0.001 p=0.002 Element smallness ρ=−0.640 ρ=−0.033 ρ=−0.811* ρ=0.406 ρ=0.594** ρ=0.700** p=0.068 p=0.930 p=0.025 p=0.061 p=0.010 p=0.004 Inconsistency ρ=−0.812* ρ=−0.845* ρ=−0.800* ρ=0.312* ρ=0.668** ρ=0.784** p=0.012 p=0.013 p=0.017 p=0.042 p=0.004 p=0.002 Color complexity ρ=−0.190 ρ=0.052 ρ=−0.241 ρ=0.401 ρ=0.722* ρ=0.155* p=0.555 p=0.540 p=0.644 p=0.092 p=0.034 p=0.040 Typographic complexity ρ=−0.833* ρ=−0.780* ρ=−0.166 ρ=0.031 ρ=0.596* ρ=0.623* p=0.022 p=0.050 p=0.500 p=0.650 p=0.020 p=0.049 Statistically significant results (P ≤ 0.05) are indicated by *, and statistically highly significant (P ≤ 0.01) results are denoted by **. View Large Table 3. Spearman correlations between the UI complexity metrics and the SUS scores resp. TLX scores. Complexity metric SUS—App1 SUS—App2 SUS—App3 TLX—App1 TLX—App2 TLX—App3 Number of UI elements ρ=−0.459 ρ=−0.645* ρ=−0.780* ρ=0.144 ρ=0.510* ρ=0.521 p=0.125 p=0.050 p=0.041 p=0.540 p=0.050 p=0.076 Misalignment ρ=−0.505 ρ=−0.188 ρ=−0.420 ρ=0.376 ρ=0.609* ρ=0.547* p=0.199 p=0.600 p=0.311 p=0.185 p=0.012 p=0.031 Imbalance ρ=−0.736* ρ=−0.087 ρ=−0.310 ρ=0.183 ρ=0.522* ρ=0.595* p=0.045 p=0.910 p=0.456 p=0.597 p=0.050 p=0.015 Density ρ=−0.786* ρ=0.000 ρ=−0.873** ρ=0.501* ρ=0.730** ρ=0.694** p=0.021 p=0.900 p=0.005 p=0.045 p=0.001 p=0.002 Element smallness ρ=−0.640 ρ=−0.033 ρ=−0.811* ρ=0.406 ρ=0.594** ρ=0.700** p=0.068 p=0.930 p=0.025 p=0.061 p=0.010 p=0.004 Inconsistency ρ=−0.812* ρ=−0.845* ρ=−0.800* ρ=0.312* ρ=0.668** ρ=0.784** p=0.012 p=0.013 p=0.017 p=0.042 p=0.004 p=0.002 Color complexity ρ=−0.190 ρ=0.052 ρ=−0.241 ρ=0.401 ρ=0.722* ρ=0.155* p=0.555 p=0.540 p=0.644 p=0.092 p=0.034 p=0.040 Typographic complexity ρ=−0.833* ρ=−0.780* ρ=−0.166 ρ=0.031 ρ=0.596* ρ=0.623* p=0.022 p=0.050 p=0.500 p=0.650 p=0.020 p=0.049 Complexity metric SUS—App1 SUS—App2 SUS—App3 TLX—App1 TLX—App2 TLX—App3 Number of UI elements ρ=−0.459 ρ=−0.645* ρ=−0.780* ρ=0.144 ρ=0.510* ρ=0.521 p=0.125 p=0.050 p=0.041 p=0.540 p=0.050 p=0.076 Misalignment ρ=−0.505 ρ=−0.188 ρ=−0.420 ρ=0.376 ρ=0.609* ρ=0.547* p=0.199 p=0.600 p=0.311 p=0.185 p=0.012 p=0.031 Imbalance ρ=−0.736* ρ=−0.087 ρ=−0.310 ρ=0.183 ρ=0.522* ρ=0.595* p=0.045 p=0.910 p=0.456 p=0.597 p=0.050 p=0.015 Density ρ=−0.786* ρ=0.000 ρ=−0.873** ρ=0.501* ρ=0.730** ρ=0.694** p=0.021 p=0.900 p=0.005 p=0.045 p=0.001 p=0.002 Element smallness ρ=−0.640 ρ=−0.033 ρ=−0.811* ρ=0.406 ρ=0.594** ρ=0.700** p=0.068 p=0.930 p=0.025 p=0.061 p=0.010 p=0.004 Inconsistency ρ=−0.812* ρ=−0.845* ρ=−0.800* ρ=0.312* ρ=0.668** ρ=0.784** p=0.012 p=0.013 p=0.017 p=0.042 p=0.004 p=0.002 Color complexity ρ=−0.190 ρ=0.052 ρ=−0.241 ρ=0.401 ρ=0.722* ρ=0.155* p=0.555 p=0.540 p=0.644 p=0.092 p=0.034 p=0.040 Typographic complexity ρ=−0.833* ρ=−0.780* ρ=−0.166 ρ=0.031 ρ=0.596* ρ=0.623* p=0.022 p=0.050 p=0.500 p=0.650 p=0.020 p=0.049 Statistically significant results (P ≤ 0.05) are indicated by *, and statistically highly significant (P ≤ 0.01) results are denoted by **. View Large The Density metric achieved significant results for NFC Tools and Ultimate To-Do List regarding the SUS scores. Additionally, the correlations of the density metric with the NASA TLX scores were significant for all researched apps. We can therefore draw the conclusion that user interfaces that are cluttered with content such as buttons or textfields, are seen as more complex by the users. Screens that provide a ‘cleaner’ interface are therefore preferred. The Number of UI Elements complexity showed mixed results. For Expense Manager, the results are statistically significant, both regarding the SUS as well as TLX score. In this case, the large number of UI elements had a negative impact on the perceived usability. A possible reason why there were no significant differences found for NFC Tools could be that the participants did not view the number of UI elements for this app as important, especially considering that the inconsistency correlation with the NASA TLX score for Ultimate To-Do List was larger than those for the other two apps. Regarding the Misalignment complexity, statistically significant correlations were found for Expense Manager and Ultimate To-Do List regarding the NASA TLX score. For the SUS scores, we did not find significant correlations. The NASA TLX results show that participants had lower mental demand for these two apps, which also had smaller Misalignment complexity scores than Ultimate To-Do List. The Imbalance metric showed a significant correlation with the SUS score of NFC Tools. Additionally, there were significant correlations between the imbalance complexity and the NASA TLX scores for Expense Manager and Ultimate To-Do List. Similar results were made with the Typographic Complexity. An increasing typographic complexity lead to declining SUS scores and increasing NASA TLX scores for all researched apps. However, the SUS correlation of Ultimate To-Do List and the NASA TLX correlation of NFC Tools were not statistically significant. The correlation results of the Color Complexity metric showed mixed results. The NASA TLX scores of the three apps increased with the color complexity. However, correlations with the SUS scores did not show any significant results. It is possible that such controversies can occur if users perceived the typography resp. the colors as either a big part of the user interface or a small part. Figure 5 shows screenshots from the three analyzed apps. The first screenshot is from NFC Tools, the second from Expense Manager and the third and fourth from Ultimate To-Do List. The Kiviat graphs below show the corresponding UI complexity values of the screenshots. Apparently, there are not only significant differences between different apps, but also between different screens within an app. The characteristics of Kiviat graphs make it easy to quickly compare multiple sets of metrics. A larger area indicates higher UI complexity and it is evident, that the first screen in Fig. 5 has the lowest UI complexity overall. Additionally, ‘outliers’ can be perceived quickly. For example, the second screen shows a high density as displayed by the corresponding Kiviat graph. Furthermore, the UI complexities of the screens 3 and 4 show differences. By analyzing each screen individually, it is possible to find screens that have certain (unwanted) characteristics, e.g. a large number of UI elements compared to the rest of the screens. Figure 5. View largeDownload slide Selected screenshots and their corresponding Kiviat graphs. Figure 5. View largeDownload slide Selected screenshots and their corresponding Kiviat graphs. Note that in comparison to Fig. 4, the consistency values have been omitted in Fig. 5 as they are calculated for the entire application based on single screen metrics. The visual analysis procedure took ~100 ms on average per screenshot. Since the tasks the participants had to perform in the three apps were chosen to be of similar navigational behavior, the number of collected screenshots per task was approximately equal, at M=27.7, SD=2.5. The screenshot size was on average 84.1 KB ( SD=23.2KB). The question remains, what does a designer or developer do with the complexity metrics for a given mobile application? Depending on the results of the complexity calculation, the designer or developer adapts the user interface. For example, if the user interface of the app is shown to be too dense (i.e. large Density complexity value) then the next step would be to remove user interface elements, introduce more screens to present the elements on, etc. Similarly, if the Imbalance is too large, the designer or developer can lower this complexity by balancing the UI elements. While we can see, that there are interesting correlations between the visual user interface complexity and the perceived usability, we are facing two problems: The optimal values for all of the proposed metrics are not known yet. In future studies, we intend to calculate the optimum complexity for each metric. As Comber and Maltby (1995) propose, the optimal complexity value is not necessarily zero, but rather a mid-range value, e.g. 0.45 or 0.53. Secondly, we do not know how much any complexity value contributes to the overall perceived usability. Is Imbalance a bigger usability issue than the (mis)use of Color or Typography? In future studies, we want to investigate if there are ‘ranks’ for the complexity metrics, so designers and developers can prioritize visual user interface complexity issues. As shown in Fig. 4, we can perform a comparison of mobile UIs between different apps. The presented system also fulfills the purpose of a diagnostic tool, indicating if certain complexity metrics deviate much more than others in order to facilitate improvements. Figure 5 shows the comparison of individual screens and their respective visual user interface complexity. Our proposed tool can therefore also be used for A/B studies that compare multiple different screen designs. 5. CONCLUSION The final implementation of the presented prototyping system enables app developers and designers to evaluate the user interfaces of Android applications based on metrics. These metrics focus on the visual aspects of the user interface. To evaluate the presented concept and the implemented prototype, a user study in which 12 participants had to perform two tasks for each of the three analyzed apps was conducted. First of all, this study showed that the system works stable and captures all relevant data for further analysis. The main findings of the study were that there was a positive correlation between some of the calculated complexity metrics and the perceived usability by the participants. The biggest problem that emerged from the study was how much each metric contributed to the perceived usability and what the ideal values for the metrics should be. Besides extending the presented system’s functionality, the further development of the analysis tool should primarily focus on improving the current implementation based on the knowledge gained from the user study. The presented system should additionally be developed to work on other mobile operating systems such as iOS. Another improvement of the presented system would be to assign weights to particular screens. This would allow the developer or designer to emphasize certain screens that are more important than others. More weight for critical screens would result in even more meaningful metrics than just assigning weights to specific metrics in general. Another aspect for the future would be to conduct an empirical study for app developers, the potential customers of the presented system. They could give further feedback about their desired build automation and testing pipeline, which could include the user interface complexity calculation among existing processes like crash testing or unit testing. Further improvement to this system would be an integration into a seamless user interface development environment for designers, containing tools to help the designer in defining tasks, choosing and organizing appropriate UI widgets, and analyzing the whole user interface. This can be improved even further by giving recommendations after the automatic user interface analysis. For example, if a screen has a large imbalance metric value, the automatic recommendation would be to center specify user interface elements in order to reach a lower complexity score for the imbalance metric. Another example would be the usage of many font sizes throughout the application, which lead to an increased typographic complexity score. Then, the recommendation would be to limit the number of different font sizes. ACKNOWLEDGEMENTS The research presented is conducted within the Austrian project “LEEFF (Low Emission Electric Freight Fleets)”. FUNDING The Austrian Research Promotion Agency (FFG) (contract number 853768). REFERENCES Alemerien, K. and Magel, K. ( 2014) GUIEvaluator: A Metric-tool for Evaluating the Complexity of Graphical User Interfaces. In The 26th Int. Conf. Software Engineering and Knowledge Engineering. Hyatt Regency, Vancouver, BC, Canada, July 1–3, 2013. pp. 13–18. Alsmadi, I. and Al-Kaabi, M. ( 2009) The introduction of several user interface structural metrics to make test automation more effective. Open Softw. Eng. J. , 3, 72– 77. Google Scholar CrossRef Search ADS AlTaboli, A. and Abou-Zeid, M.R. ( 2007) Effect of Physical Consistency of Web Interface Design on Users’ Performance and Satisfaction. In Human–Computer Interaction. HCI Applications and Services, 12th Int. Conf. HCI International 2007, Beijing, China, July 22–27, 2007, Proceedings, Part IV. pp. 849–858. Bangor, A., Kortum, B. and Miller, J. ( 2008) An empirical evaluation of the system usability scale. J. Hum. Comput. Interact. , 24, 574– 594. Google Scholar CrossRef Search ADS Bangor, A., Kortum, B. and Miller, J. ( 2009) Determining what individual SUS scores mean: adding an adjective rating scale. J. Usability Stud. , 4, 114– 123. Bevan, N. ( 2009) International standards for usability should be more widely used. J. Usability Stud. , 4, 106– 113. Bevan, N. and Curson, I. ( 1997) Methods for Measuring Usability. In Human-Computer Interaction, INTERACT ‘97, IFIP TC13 Interantional Conference on Human-Computer Interaction, 14th–18th July 1997, pp. 672–673. Sydney, Australia. Beymer, D., Russell, D.M. and Orton, P.Z. ( 2008) An Eye Tracking Study of how Font Size and Type Influence Online Reading. (2008), pp. 15–18. Bonsiepe, G. ( 1968) A method of quantifying order in typographic design. J. Typogr. Res. , 2, 203– 220. Brinkmann, R. ( 1999) The Art and Science of Digital Compositing . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Brooke, J. ( 2013) SUS: a retrospective. J. Usability Stud. , 8, 29– 40. Brooks, R. ( 1965) Search time and color coding. Psychon. Sci. , 2, 281– 282. Google Scholar CrossRef Search ADS Cardani, D. ( 2001) Adventures in HSV Space. Laboratorio de Robótica, Instituto Tecnológico Autónomo de México. Carter, R.C. ( 1982) Visual search with color. J. Exp. Psychol. Hum. Percept. Perform. , 8, 127– 136. Google Scholar CrossRef Search ADS PubMed Christ, R.E. ( 1975) Review and analysis of color coding research for visual displays. Hum. Factors , 17, 542– 570. Google Scholar CrossRef Search ADS Comber, T. and Maltby, J.R. ( 1995) Evaluating Usability of Screen Design With Layout Complexity. Cowen, L., Ball, J.S.L. and Delin, J. ( 2005) An Eye Movement Analysis of Web Page Usability. People and Computers XVI—Memorable Yet Invisible: Proceedings of HCI 2002 (2002). pp. 317–335. Engel, F.L. ( 1980) Information Selection from Visual Displays. Ergonomic Aspects of Visual Display Terminals. Fu, F.-L., Chiu, S.-Y. and Su, C.H. ( 2007) Measuring the Screen Complexity of Web Pages. In Human Interface and the Management of Information. Interacting in Information Environments, Symposium on Human Interface 2007, Held as Part of HCI International 2007, Beijing, China, July 22–27, 2007, Proceedings, Part II. pp. 720–729. Galitz, W.O. ( 2007) The Essential Guide to User Interface Design: An Introduction to GUI Design Principles and Techniques ( 3 edn). John Wiley & Sons, Inc., New York, NY, USA. Gong, J. and Tarasewich, P. ( 2004) Guidelines for handheld mobile device interface design). In Proceedings of DSI 2004 Annual Meeting. pp. 3751–3756. Halstead, M.H. ( 1977) Elements of Software Science (Operating and Programming Systems Series) . Elsevier Science Inc., New York, NY, USA. Henry, S.M. and Kafura, D.G. ( 1981) Software structure metrics based on information flow. IEEE Trans. Softw. Eng. , 7, 510– 518. Google Scholar CrossRef Search ADS Henry, S.M. and Selig, C. ( 1990) Predicting source-code complexity at the design stage. IEEE Softw. , 7, 36– 44. Google Scholar CrossRef Search ADS Holzmann, C., Riegler, A., Steiner, D. and Grossauer, C. ( 2017) An Android Toolkit for Supporting Field Studies on Mobile Devices. In Proc. 16th Int. Conf. Mobile and Ubiquitous Multimedia. Humar, I., Gradisar, M. and Turk, T. ( 2008) The impact of color combinations on the legibility of a web page text presented on CRT displays. Int. J. Ind. Ergon. , 38, 885– 899. Google Scholar CrossRef Search ADS Kellogg, W.A. ( 1989) Coordinating User Interfaces for Consistency . Academic Press Professional, Inc, San Diego, CA, USA, Chapter The Dimensions of Consistency, pp. 9– 20. Google Scholar CrossRef Search ADS Ketchen, D.J. and Shook, C.L. ( 1996) The application of cluster analysis in strategic management research: an analysis and critique. Strateg. Manage. J. , 17, 441– 458. Google Scholar CrossRef Search ADS Kopala, C.J. ( 1981) The Use of Color Coded Symbols in a Highly Dense Situation Display. In Proc. Human Factors Society 23rd Annual Meeting. Lalomia, M.J. and Happ, A.J. ( 1987) The Effective Use of Color for Text on the IBM 5153 Color Display. In Proc. Human Factors and Ergonomics Society Annual Meeting 31, 10 (1987). pp. 1091–1095. Lawson, S., Jamison-Powell, S., Garbett, A., Linehan, C., Kucharczyk, E., Verbaan, S., Rowland, D. and Morgan, K. ( 2013) Validating a Mobile Phone Application for the Everyday, Unobtrusive, Objective Measurement of Sleep. In Proc. SIGCHI Conference on Human Factors in Computing Systems, pp. 2497–2506. Lumsden, J. and Brewster, S. ( 2003) A Paradigm Shift: Alternative Interaction Techniques for Use with Mobile & Wearable Devices. In Proc. 2003 Conference of the Centre for Advanced Studies on Collaborative Research. pp. 197–210. Luria, S.M., Neri, D.F. and Jacobsen, A.R. ( 1986) The effects of set size on color matching using CRT displays. Hum. Factors , 28, 49– 61. Google Scholar CrossRef Search ADS PubMed Ma, X., Yan, B., Chen, G., Zhang, C., Huang, K., Drury, J.L. and Wang, L. ( 2013) Design and Implementation of a toolkit for usability testing of mobile apps. MONET , 18, 81– 97. Marcus, A. ( 1986) Ten Commandments of Color. Computer Graphics Today. McCabe, T.J. ( 1976) A complexity measure. IEEE Trans. Softw. Eng. , SE-2, 308– 320. Google Scholar CrossRef Search ADS Morris, M.F. ( 1974) Kiviat graphs: conventions and ‘figures of merit’. SIGMETRICS Perform. Eval. Rev. , 3, 2– 8. Google Scholar CrossRef Search ADS Nagy, A.L. and Sanchez, R.R. ( 1992) Chromaticity and luminance as coding dimensions in visual search. Hum. Factors , 34, 601– 614. Google Scholar CrossRef Search ADS PubMed Nielsen, J. (ed.) ( 1989) Coordinating User Interfaces for Consistency . Academic Press Professional, Inc, San Diego, CA, USA. Nielsen, J. ( 1992) The usability engineering life cycle. In Computer , 25, pp. 12– 22. Otsu, N. ( 1979) A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. , 9, 62– 66. Google Scholar CrossRef Search ADS Paternò, F., Schiavone, A.G. and Conti, A. ( 2017) Customizable Automatic Detection of Bad Usability Smells in Mobile Accessed Web Applications. In Proc. 19th Int. Conf. on Human-Computer Interaction with Mobile Devices and Services, pp. 1–11. Pinzger, M., Gall, H.C., Fischer, M. and Lanza, M. ( 2005) Visualizing multiple evolution metrics. In Proc. ACM 2005 Symposium on Software Visualization. St. Louis, Missouri, USA, May 14–15, pp. 67–75. Riegler, A. and Holzmann, C. ( 2015) UI-CAT: Calculating User Interface Complexity Metrics for Mobile Applications. In Proc. 14th Int. Conf. Mobile and Ubiquitous Multimedia, Linz, Austria, November 30–December 2, pp. 390–394. Sears, A. ( 2001) AIDE: A Tool to Assist in the Design and Evaluation of User Interfaces. Shneiderman, B. and Plaisant, C. ( 2004) Designing the User Interface: Strategies for Effective Human-Computer Interaction ( 4th edn). Addison Wesley Longman, Boston, MA, USA. Sidorsky, R.C. ( 1982) Color Coding in TacticalDisplays: Help or Hindrance. Army Research Institute Research Report. Taba, S.E.S., Keivanloo, I., Zou, Y., Ng, J.W. and Ng, T. ( 2014) An Exploratory Study on the Relation between User Interface Complexity and the Perceived Quality. In Web Engineering, 14th Int. Conf., ICWE 2014, Toulouse, France, July 1–4, 2014. Proceedings. pp. 370–379. Tullis, T.S. ( 1981) An evaluation of alphanumeric, graphic, and color information displays. Hum. Factors , 23, 541– 550. Google Scholar CrossRef Search ADS Tullis, T.S. ( 1983) The formatting of alphanumeric displays: a review and analysis. Hum. Factors , 25, 657– 682. Google Scholar CrossRef Search ADS PubMed Tullis, T. and Albert, W. ( 2013) Measuring the User Experience, Second Edition: Collecting, Analyzing, and PresentingUsability Metrics ( 2nd edn). Morgan Kaufmann Publishers Inc, San Francisco, CA, USA. Vanderdonckt, J. and Gillo, X. ( 1994) Visual Techniques for Traditional and Multimedia Layouts. In Proc. Workshop on Advanced Visual Interfaces, AVI 1994, Bari, Italy, June 1–4, 1994. pp. 95–104. Vannieuwenborg, F. ( 2012) Business models for the mobile application market from a developer’s viewpoint. In Intelligence in Next Generation Networks, ICIN 2012. pp. 171–178. Winckler, M. and Palanque, P. ( 2013) StateWebCharts: A Formal Description Technique Dedicated to Navigation Modelling of Web Applications. Interactive Systems. Design, Specification, and Verification: 10th Int. Workshop, DSV-IS 2003, Funchal, Madeira Island, Portugal, June 11–13, 2003. pp. 61–76. Zen, M. and Vanderdonckt, J. ( 2014) Towards an Evaluation of Graphical User Interfaces Aesthetics Based on Metrics. In IEEE 8th Int. Conf. Research Challenges in Information Science, RCIS 2014, Marrakech, Morocco, May 28–30, 2014. pp. 1–12. Zhang, D. and Adipat, B. ( 2005) Challenges, methodologies, and issues in the usability testing of mobile applications. Int. J. Hum. Comput. Interact. , 18, 293– 308. Google Scholar CrossRef Search ADS Footnotes 1 https://opencv.org/ 2 https://play.google.com/store/apps/details?id=com.wakdev.wdnfc&hl=en 3 https://play.google.com/store/apps/details?id=com.expensemanager&hl=en 4 https://play.google.com/store/apps/details?id=com.customsolutions.android.utl&hl=en Author notes Editorial Board Member: Dr Fabio Paternò © The Author(s) 2018. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved. For Permissions, please email: firstname.lastname@example.org This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)
Interacting with Computers – Oxford University Press
Published: Apr 10, 2018
It’s your single place to instantly
discover and read the research
that matters to you.
Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.
All for just $49/month
Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly
Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.
Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.
Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.
All the latest content is available, no embargo periods.
“Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”Daniel C.
“Whoa! It’s like Spotify but for academic articles.”@Phil_Robichaud
“I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”@deepthiw
“My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”@JoseServera