SmiDCA: An Anti-Smishing Model with Machine Learning Approach

SmiDCA: An Anti-Smishing Model with Machine Learning Approach Abstract Phishing has become a serious cyber-security issue, and it is spreading through various media such as e-mail, SMS to capture the victim’s critical profile information. Although many novel anti-phishing techniques have been developed to forestall the progress of phishing, it remains an unresolved issue. Smishing is an incarnation of Phishing attack, which utilizes Short Messaging Service (SMS) or simple text message on mobile phones to lure the victim’s online credentials. This paper presents an anti-phishing model entitled ‘SmiDCA’ (SMIshing Detection based on Correlation Algorithm). The proposed model has collected different smishing messages from various sources, and 39 distinct features were extracted initially. The SmiDCA model incorporates dimensionality reduction, and machine Learning-based experiments were conducted on without (BFSA) and with (AFSA) reduction of features. The model has been validated with experiments on both the English and non-English datasets and the results of both of these experiments are encouraging in terms of accuracy: 96.40% for English dataset and 90.33% for the non-English dataset. In addition, the model achieved an accuracy of 96.16% even after nearly half of the features were pruned. 1. INTRODUCTION Short Message Service (SMS) or simple text message is an electronic messaging service which has become a major communication channel to share shorter text, among the mobile phone users. The SMS is used for both personal and official purposes. It has become a common practice among well-known organizations to consume SMS for communicating with their customers and 75% of persons incline toward SMS communications for deliveries, promotions and surveys [1]. The increasing trend of Text Message Statistics by Statistic Brain Research Institute which was released on 17 September 2017, is as shown in Fig. 1 [2]. With the exponential growth of text-messages, violations, such as phishing message, spams, are also increasing. It has been proven by various studies that apart from being annoying to the users, these lead to the vast amount of monetary harm for both people and organizations [3–6]. Figure 1. View largeDownload slide Number of text messages (in billion). Figure 1. View largeDownload slide Number of text messages (in billion). Smishing is a variant of phishing in which smishers (an attacker who uses SMS for phishing) send text messages to the victim’s smartphone that appears similar to genuine messages. Many users fall prey to these types of messages and disclose their sensitive credentials such as user id, password [7–9]. The Mobile Messaging Fraud Report reveals that 28% of SMS users receive an unsolicited text message every day [10]. According to security firm Cloudmark, around 30 million smishing messages are sent to mobile users across North America, Europe.1 Most of the attackers favor text messages instead of e-mails to attack victims because text messages are cheaper and with tiny SMS package, they are able to send a large number of messages to the victims [11]. Moreover, the text messages have the higher response rates than e-mails. According to the business2community, the average open rate of a text message sits approximately 99%, in comparison of e-mail ranges from 28% to 33% and in addition, the ‘click’ rate of the included link for the text message is approximate 36%, while for e-mail it is between 6% and 7% [12]. In addition, it is crucially hard to distinguish whether the URL is legitimate URL or smishing URL by looking at the comparatively smaller-sized display in the mobile phones [13]. Smishing attack broadly involves: any of the following two activities [14–16]: Phone conversation: The attackers send smishing message to the victims with respect to purchase points of interest, exchanges, discounts and others alongside a phone number so that the victims answer through the phone number, and attackers request their credentials. Embedded URL: In this technique, the attackers insert malicious code into URL into the text messages and send to the victims. Once the victims verify the messages and click the URL then the malicious code installs into the victim’s phone. In addition, some attackers embedded phishing website URL to the text messages. The literature revels that there are a few strategies that concentrate on solely analyzing the smishing messages and providing the significant techniques to detect the smishing messages. Among all the existing techniques, most researchers prefer whitelist [17–19] and blacklist approach [20, 21] to combat phishing attack. However, the major drawback of the blacklist is that it requires more time to update the list [22]. Phishing SMS very quickly accomplished their task and removes their connection. The whitelist is somehow is useful for the real-time experiment; whereas, the major problem of whitelist, the number which is not on the whitelist is regarded as phishing SMS [23]. Hence, we proposed a model based on heuristics approach [24, 25] which is fast and has a low false-negative rate. In the heuristic-based approach, the model extracts the discriminating features from both the legitimate SMS and the smishing SMS and builds a training dataset. When the users receive a new SMS, the model compares the SMS with the training dataset using machine learning algorithm and predicts the message, whether smishing or legitimate. However, the major drawback of this approach is the size of features because there is a large number of features in the feature’s space in which some of the relevant, and some are irrelevant to the particular task. Therefore, the motivation of the paper is to develop a model that detects smishing messages by extracting specific features from smishing messages. Further, the features selection method is applied to filter-out the less relevant features from the feature’s corpus. The reason for stopping the phishing attack at the message’s level instead of at the website level is a prevention strategy. For website strategy to work, the users require to click the website link, whereas most of the links in the attack messages contain spam file instead of the actual link to the website. On clicking the link, the spam shall be installed automatically on the user’s mobile. Hence, to ceasing phishing at message’s level is an optimal solution than the detection at the website level. The major objectives of this paper are as follows: To detect smishing messages with the proposed anti-smishing model entitled SmiDCA. To extract several features from the smishing messages using text mining, NLP, readability algorithms. To select relevant features from several features using the correlation algorithm. To implement in real-time app on the machine, a prototype of the model is provided from the provider point of view. To search the effective features set from the relevant features, by incorporating learning algorithms along with the search algorithm. To verify the model, both English and non-English datasets have experimented. The remainder of this paper is organized as follows: Section 2 presents an overview of the related works. The proposed model is explained in Section 3. Section 4 shows the result of the experiment. Section 5 discusses the outcome of the experimental result. The summary of the paper is depicted in the section 6. 2. RELATED WORK Smishing is gaining increased attention among the researchers in the recent years [26, 27]. Many of the studies have emphasized on the importance of awareness education in order to combat the smishing messages [13, 28]. Few of the studies have proposed models to detect smishing messages. The short URL is one of the novels phishing or smishing attack where users are unable to perceive the features of the linked information or data, and it is exceptionally hard to verify which file or web page the short URL interfaces to users. A novel method was proposed [29] which composes the destination information when generating a short URL so that the user can verify whether the destination is a web page or a file. On analyzing the web page, the method measures and evaluates the risk of the web page and decides whether to block the short URL as per threshold, which prevents attacks. Several researchers have provided multi-filter models by amalgamating multiple approaches to prevent phishing. One recent model ‘PhiDMA’ was proposed by [25] and this model has incorporated five layers such as Auto upgrade whitelist layer, URL feature’s layer, Lexical signature layer, String matching layer and Accessibility Score comparison layer. Moreover, the authors developed a prototype of this model, which assists the persons with visual impairments by including the accessible interface. The result of the experiments shows that the model evaluated the accuracy 92.72%. Another study was carried out by Baek et al. [30] on the premise of time periods during the day and found that the highest peaks of spam messages were sent at 10 am and 4 pm. They, likewise, conducted one study regarding to the frequency of words and showed that the word candidate, congressmen, election, candidate number and information were very high frequency. Finally, they were analyzing the contents of the spam sent by each spammer in order to distinguish the smishing. For identifying the smishing message, they searched the keyword which is associated with the URL and if the keyword was found, then it was regarded as smishing. In another study to detect smishing messages [31], the authors collected seven features such as words, size, misspelled, part-of-speech, the presence of phone number and URL. The result of the experiment shows that Random Forest (RF) classifier secured the accuracy of 92%. Cloud-based virtual environment was employed to detect smishing messages or detect unknown malware [32]. The authors identified the suspicious URL by checking whether it belongs to the downloading APK record or an application without source. The method, likewise, increments the likelihood smishing identification by utilizing filtering. A mobile phone-specific anti-phishing solution was presented by Foozy et al. [33]. They identified the important criterion to improve the techniques to combat against phishing attack on mobile. In this paper, the authors have mentioned that the attackers incorporated mobile device phishing like Bluetooth phishing, Smishing, Vishing, mobile web application phishing, etc. In addition, Pandey et al. [34] have presented text mining and data mining-based technique, which extracted 23 keywords from phishing and non-phishing e-mails. Subsequently, they selected 12 keywords using t-statistic-based feature selection and conducted the experiment by multiple machine learning classifications. One of the most-recent methods named S-Detector (Smishing detector) was proposed to differentiate the normal text message from Smishing messages [35]. This model employed the morphological analyzer to extract the noun words which were frequently used in Smishing text messages and Naïve Bayesian classifier was used to filter. The result of the experiment shows that this model provides security, availability and reliability in preventing more intelligent and more malicious security threats. A novel approach based on probabilistic neural networks (PNNs) and K-medoids clustering [36] to detect phishing websites was proposed. Principal component analysis (PCA) was applied to reduce the dimensionality of the feature space. Finally, the model carried out the experiment with 30 features, and the results show that the proposed model achieved near 40% reduction in the complexity, approximate 97% accuracy. In our proposed anti-smishing model SmiDCA, 39 specific features were extracted by analyzing the smishing messages. Well-known machine learning algorithms were applied to the feature set built using the aforementioned features, to evaluate the performance. In addition, the model incorporated feature’s selection algorithm to reduce the dimension of the features and compare the performance before feature selection algorithm and after the feature selection algorithm. On experimenting with the non-English text messages dataset, it can be concluded that the model has the tendency to detect non-English smishing messages as well. 3. METHODOLOGY The methodology of the SmiDCA is as shown in Fig. 2. The model initially investigates the data and extracts the distinct features. Subsequently, the model ranks all the features by exploiting correlation algorithm and generates the subset by adding of high ranked features one by one and sending them to the Learning algorithm. The learning algorithm evaluates the accuracy and verifies whether the accuracy is better than the previous accuracy or not. If the accuracy is increased, then adds the next high ranked features to the subset; otherwise, terminates the process. Figure 2. View largeDownload slide Methodology of SmiDCA. Figure 2. View largeDownload slide Methodology of SmiDCA. 3.1. Feature analysis In this section, the model analyzes the smishing messages, which were collected from the different sources such as smishing dataset which is publicly available [37]: Words features: The word features deal with the words which are frequently employed by smishers to cheat the victims. The model carried out the following steps to find the frequent words: The model tokenized the words using white space. Converted all the words into lower characters. Eliminated the stop words such as a, an, is etc. Employed the term-frequency algorithm to find the frequent words, as shown in the following equation:  tf(t,d)=f(t,d)d (1)where f(t,d) is the frequency of term t in document d. The model evaluates the word list features by taking top 20 keywords, as shown in Table 1. URL feature: One of the stronger characteristics of smishing message is that smishers embedded URL with the smishing messages so that the victims can visit the phishing site which is comparable with the genuine site as shown in Fig. 3 [37]. In the analysis, the model has found that the URLs were present in 53.68% of smishing messages. E-mail-id feature: Another essential feature to perceive smishing message is that smishers send the text message alongside e-mail-id with the goal that they can ask the credentials through e-mail is shown in Fig. 4 [37]. The smishing e-mail-id is outlined as genuine e-mail-id so that victims get convinced. The model initially searches the special character (‘@’) in the document as every e-mail-id contains this special character. After that, the model examines the prefix and the suffix of the special character. In the examination, the model has found that 11.57% of smishing messages contained e-mail id. Phone number feature: Telephone number is also a noteworthy feature of smishing message. Often, smishers send the text message with a telephone number through which they can ask the credentials from the victims as shown in Fig. 5 [37]. The model initially investigates the numeric characters from the document. If no numeric character is available in the document, then terminate otherwise, the model finds the pattern of the phone number in the numeric characters. Size feature: One of the basic features smishing messages is the size. Figure 6 shows the histogram of different sizes of smishing messages and legitimate messages. Special character feature: Most of the legitimate organizations exclude any type special characters in their messages. However, most of the smishing messages have a special character such as ‘$100’, ‘£’ and others. The model initially searches the availability of special character in the document. If no special character is available in the document, then terminate; otherwise, the model examines whether any numeric character is succeeding the special character or not. If the numeric characters succeed the special characters, then it is regarded as a feature value. Readability text feature: Readability of text measures the ease of understanding the English text. Companies, organization and others employ their own writing style of text and most of the time; they employ trained content writers to verify the messages before sending to customers. Several researchers employed readability as features to identify phishing or spam e-mails [38, 39]. In this paper, the model has adopted the six important readability algorithms, which are as shown below: Automated readability index: The Automatic readability index is used to calculate the readability score based on the understandability of English text [40]. The equation of the automatic readability index is shown in the following equation:   ARI=4.71CW+0.5WS−21.43 (2) where C be the characters and numbers, W be the words that are, the number of spaces and S is denoted by the sentences. Flesch–Kincaid Readability test: Rudolf Flesch developed the Flesch–Kincaid Readability test which is used to indicate how difficult a text in English is to understand [41] Two tests are conducted: Flesch–Kincaid Grade Level and Flesch Reading-Ease Score. The equation of the Flesch–Kincaid Grade Level is shown in the following equation:   FKGL=0.39WS+11.8SyW−15.59 (3)where W be the total words, S be the total sentence and Sy be the total syllables. Flesch reading-ease score (FRES) test is shown in the following equation:   FRES=206.835−1.015WS−84.6SyW (4)where W be the total words, S be the total sentence and Sy be the total syllables. Gunning Fog Index: Robert Gunning, an American businessman developed this readability test [42]. The following steps are used to calculate the Gunning fog: Select a passage of around 100 words without omitting any sentences. Determine the average sentence length. (Divide the number of words by the number of sentences.) Count the ‘complex’ words three or more syllables. Add the average sentence length and the percentage of complex words. Multiply the result by 0.4.The equation of Gunning Fog Index is shown in the following equation:   GFI=0.4WS+100CWW (5)where W be the word, S be the sentence and CW is the complex word. SMOG Index: McLaughlin [43] developed this SMOG index and SMOG is widely used for checking health messages. The equation of Smog to test readability score is shown in the following equation:   SMOG=1.0430P×30S+3.1291 (6)where P be the number of Polysyllables and S be the number of sentence. Coleman–Liau Index (CLI): Coleman and Liau [44] developed the CLI to calculate the readability score. The equation of the CLI is shown in the following equation:   CLI=0.0588L−0.296S−15.8 (7)where L is the average number of letters per 100 words and S is the average number of sentences per 100 words. Misspelled word feature: The Misspelled word is additionally used for our feature’s corpus because a majority of the smishers send the misspelled messages to the victims. In this feature, the model initially tokenized on the basis of white space between two words and consequently, expelled the special characters such as {?,#…etc.} from the document. As our smishing dataset contains the American English messages, the model verifies the words based on American English dictionary. This feature is computed using the following equation:   C(m,d)=f(m,d)t(w,d) (8)where C(m,d) computes the misspell words ( m) in the document ( d). f(m,d) be the frequency of the misspelled words in document and t(w,d) be the total number of words in the document. Parts of Speech tagging features: The parts of speech tagging is used to analysis the words into the parts of speech such as noun and verb. The model employed natural language took kit (NLTK) which is a python package to identify POS [45–48]. Assume, W={w1,w2,…wn} be the n number of words in an original document and P={p1,p2,…pn} that is {noun, pronoun, …} be the part of speech. Subsequently, the model evaluates the POS tags of the words ( wi) where wi∈W and generates new tag document T={twp1,twp2,…twpn} and then counts the frequency of the tags in the document T.The summary of features is shown in Table 2. Figure 3. View largeDownload slide Smishing through URL. Figure 3. View largeDownload slide Smishing through URL. Figure 4. View largeDownload slide Smishing through e-mail-id. Figure 4. View largeDownload slide Smishing through e-mail-id. Figure 5. View largeDownload slide Smishing through phone number. Figure 5. View largeDownload slide Smishing through phone number. Figure 6. View largeDownload slide Size of smishing and legitimate messages. Figure 6. View largeDownload slide Size of smishing and legitimate messages. Table 1. Frequency of words in smishing dataset. Keywords  Frequency  Keywords  Frequency  Please  31.58  Sms  10.53  Account  18.95  Customer  9.47  Card  16.84  E-mail  9.47  Apple  16.84  Details  9.47  Update  16.84  Iphone  8.42  Online  14.74  Bank  8.42  Link  11.58  Message  8.42  Call  10.53  Store  8.42  Today  10.53  Nationwide  8.42  Refund  10.53  Due  8.42  Keywords  Frequency  Keywords  Frequency  Please  31.58  Sms  10.53  Account  18.95  Customer  9.47  Card  16.84  E-mail  9.47  Apple  16.84  Details  9.47  Update  16.84  Iphone  8.42  Online  14.74  Bank  8.42  Link  11.58  Message  8.42  Call  10.53  Store  8.42  Today  10.53  Nationwide  8.42  Refund  10.53  Due  8.42  View Large Table 1. Frequency of words in smishing dataset. Keywords  Frequency  Keywords  Frequency  Please  31.58  Sms  10.53  Account  18.95  Customer  9.47  Card  16.84  E-mail  9.47  Apple  16.84  Details  9.47  Update  16.84  Iphone  8.42  Online  14.74  Bank  8.42  Link  11.58  Message  8.42  Call  10.53  Store  8.42  Today  10.53  Nationwide  8.42  Refund  10.53  Due  8.42  Keywords  Frequency  Keywords  Frequency  Please  31.58  Sms  10.53  Account  18.95  Customer  9.47  Card  16.84  E-mail  9.47  Apple  16.84  Details  9.47  Update  16.84  Iphone  8.42  Online  14.74  Bank  8.42  Link  11.58  Message  8.42  Call  10.53  Store  8.42  Today  10.53  Nationwide  8.42  Refund  10.53  Due  8.42  View Large Table 2. Summary of the features. Features  Feature name  Data-type  Description  ⟨f1,…f20⟩  Words feature:  {0,1}  Smishing messages contain frequent words like please, Sms, Account, Customer, Card, E-mail, Apple, Details, Update, Iphone, Online, Bank, Link, Message, Call, Store, Today, Nationwide, Refund, Due  ⟨f21⟩  Size of message  Numeric  The size of smishing messages and legitimate messages are different  ⟨f22⟩  E-mail-id feature  {0,1}  Most of the smishing messages contain the e-mail-id to ask the credentials  ⟨f23⟩  URL feature  {0,1}  A URL is embedded in smishing messages to visit the phishing sites  ⟨f24⟩  Phone number feature  {0,1}  Phone number is sent along with messages so that the attackers contact with victims  ⟨f25⟩  Special character  {0,1}  Most of the smishing messages contain special characters like $ and £ to attract users  ⟨f26⟩  Misspelled word  Numeric  A majority of legitimate brand’s messages never send any misspelled words  ⟨f27…f33⟩  Part of Speech  Numeric  The part of speech of most of the smishing message is distinct from legitimate messages such as noun, pronoun, adjective, verb, adverb, preposition and conjunction  ⟨f34…f39⟩  Readability of text  Numeric  The text style of smishing and legitimate messages are unusual  Features  Feature name  Data-type  Description  ⟨f1,…f20⟩  Words feature:  {0,1}  Smishing messages contain frequent words like please, Sms, Account, Customer, Card, E-mail, Apple, Details, Update, Iphone, Online, Bank, Link, Message, Call, Store, Today, Nationwide, Refund, Due  ⟨f21⟩  Size of message  Numeric  The size of smishing messages and legitimate messages are different  ⟨f22⟩  E-mail-id feature  {0,1}  Most of the smishing messages contain the e-mail-id to ask the credentials  ⟨f23⟩  URL feature  {0,1}  A URL is embedded in smishing messages to visit the phishing sites  ⟨f24⟩  Phone number feature  {0,1}  Phone number is sent along with messages so that the attackers contact with victims  ⟨f25⟩  Special character  {0,1}  Most of the smishing messages contain special characters like $ and £ to attract users  ⟨f26⟩  Misspelled word  Numeric  A majority of legitimate brand’s messages never send any misspelled words  ⟨f27…f33⟩  Part of Speech  Numeric  The part of speech of most of the smishing message is distinct from legitimate messages such as noun, pronoun, adjective, verb, adverb, preposition and conjunction  ⟨f34…f39⟩  Readability of text  Numeric  The text style of smishing and legitimate messages are unusual  View Large Table 2. Summary of the features. Features  Feature name  Data-type  Description  ⟨f1,…f20⟩  Words feature:  {0,1}  Smishing messages contain frequent words like please, Sms, Account, Customer, Card, E-mail, Apple, Details, Update, Iphone, Online, Bank, Link, Message, Call, Store, Today, Nationwide, Refund, Due  ⟨f21⟩  Size of message  Numeric  The size of smishing messages and legitimate messages are different  ⟨f22⟩  E-mail-id feature  {0,1}  Most of the smishing messages contain the e-mail-id to ask the credentials  ⟨f23⟩  URL feature  {0,1}  A URL is embedded in smishing messages to visit the phishing sites  ⟨f24⟩  Phone number feature  {0,1}  Phone number is sent along with messages so that the attackers contact with victims  ⟨f25⟩  Special character  {0,1}  Most of the smishing messages contain special characters like $ and £ to attract users  ⟨f26⟩  Misspelled word  Numeric  A majority of legitimate brand’s messages never send any misspelled words  ⟨f27…f33⟩  Part of Speech  Numeric  The part of speech of most of the smishing message is distinct from legitimate messages such as noun, pronoun, adjective, verb, adverb, preposition and conjunction  ⟨f34…f39⟩  Readability of text  Numeric  The text style of smishing and legitimate messages are unusual  Features  Feature name  Data-type  Description  ⟨f1,…f20⟩  Words feature:  {0,1}  Smishing messages contain frequent words like please, Sms, Account, Customer, Card, E-mail, Apple, Details, Update, Iphone, Online, Bank, Link, Message, Call, Store, Today, Nationwide, Refund, Due  ⟨f21⟩  Size of message  Numeric  The size of smishing messages and legitimate messages are different  ⟨f22⟩  E-mail-id feature  {0,1}  Most of the smishing messages contain the e-mail-id to ask the credentials  ⟨f23⟩  URL feature  {0,1}  A URL is embedded in smishing messages to visit the phishing sites  ⟨f24⟩  Phone number feature  {0,1}  Phone number is sent along with messages so that the attackers contact with victims  ⟨f25⟩  Special character  {0,1}  Most of the smishing messages contain special characters like $ and £ to attract users  ⟨f26⟩  Misspelled word  Numeric  A majority of legitimate brand’s messages never send any misspelled words  ⟨f27…f33⟩  Part of Speech  Numeric  The part of speech of most of the smishing message is distinct from legitimate messages such as noun, pronoun, adjective, verb, adverb, preposition and conjunction  ⟨f34…f39⟩  Readability of text  Numeric  The text style of smishing and legitimate messages are unusual  View Large 3.2. Machine learning classifiers Several machine classification techniques are used to classify smishing messages from legitimate messages. The classifiers learn from a set of features, which is called training datasets and predict the output. In this scenario, the model classifies the messages, whether smishing or legitimate by learning the features from smishing and legitimate messages. In this paper, the proposed method initially evaluated the performances of four well-known classifiers such as random forest [49–52], Decision Tree [53], Support Vector Machine [54–56] and AdaBoost [57–59] in given set of data. Subsequently, the model selects the classifier which has the superior performance and applies the classifier to the features selection algorithm. 3.2.1. Cross-validation Cross-validation validates the classifiers by partitioning the dataset into complementary subsets where one subset is chosen as test dataset and others are chosen as training dataset. The k-fold cross-validation divides the dataset into k equal subsets and each time one subset of k is taken as a test, and other k−1 subsets are assembled to frame a training set. Assume, nk={c1,c2,…ck} be the K parts where nk←NK. The equation of cross-validation (CV) is shown in the following equation:   CVk=1K∑i=1kEi (9)where Ei=1nk∑i∈ckk(yi−yi¯)2 be the mean squared error (MSE). 3.3. Pearson correlation coefficient Pearson correlation coefficient (PCC) measures the linear correlation between two features [60, 61]. It evaluates three categories of correlation; positive linear correlation is regarded as 1, no linear correlation is 0 and negative linear correlation is −1. Several researchers used PCC to select relevant features [62, 63]. Assume, X={x1,x2,…xn} and Y={y1,y2,…yn} are two sets of features. The PCC is defined by ρ and equation is shown in the following equation:   ρ(X,Y)=cov(X,Y)σX,σY (10)where cov(X,Y) is the covariance of X, Y and σX is the standard deviation of X and σY is the standard deviation of Y:   cov(X,Y)=1n−1∑i=1n(xi−x¯)(yi−y¯) (11)where x¯ and y¯ are mean of X and Y is given by the following equation:   x¯=1n∑x=1nxi (12)The standard deviation (σ) of is defined by the following equation:   σ(x)=∑i=1n(xi−x¯)2n−1 (13)Using Algorithm 1, the model computes the ranking scores of the features as explained in Section 4.3. 3.4. SmiDCA prototype model This section provides a description and evaluation of the Prototype model of SmiDCA. The high-level architecture of the model is shown in Fig. 7 and the specifications of the machine are the following: Operating System: Ubuntu 17.10 (64-bit), Processor: Intel Core i3 CPU 530, RAM: 4 GB and Disk: 75 GB. This model is developed with Python code, hence; the interface is designed using Tkinter package. Figure 7. View largeDownload slide Smidca prototype. Figure 7. View largeDownload slide Smidca prototype. Figure 7 shows that the model has two interfaces: one is client interface, and another is the system interface. The client interface displays the telephone number of the senders as shown in Fig. 8(a) and if the sender’s number is already in the user telephone diary, then the users can directly observe the message by tapping the Skip button otherwise the user can verify the message by tapping the Verify button. On tapping the verify button the proposed model receives the message as input and sends to the machine learning classifier. The classifier returns the feedback of the message, whether smishing or legitimate to the system interface. Figure 8(b) shows that the message is recognized as legitimate in the system interface. Figure 8. View largeDownload slide (a) Legitimate message and (b) SmiDCA identified as legitimate. Figure 8. View largeDownload slide (a) Legitimate message and (b) SmiDCA identified as legitimate. The system interface provides two buttons: Go Back button and Proceed button. The Go Back button deletes the messages without sending to the user inbox, and the Proceed button allows users to access the messages. As the message was identified as legitimate by the proposed model, therefore, the users could access the message. Subsequently, we verified the model with a smishing message as shown in Fig. 9(a) and the model detected the message as smishing as shown in Fig. 9(b) and the guideline was provided to the user to handle to the buttons. The source code of the prototype implementation is uploaded to GitHub Repository and it is available in the following URL: https://github.com/gsonowal20/SmiDCA. Figure 9. View largeDownload slide (a) Smishing message and (b) SmiDCA detected the smishing message. Figure 9. View largeDownload slide (a) Smishing message and (b) SmiDCA detected the smishing message. The objective of the prototype implementation is to demonstrate the SmiDCA approach. In the present state, the User Interface of the prototype is not optimized. One of the important interface-related obstacles indicated in the feedbacks received from a pilot study is the need to perform two taps (a tap on Verify button and a tap on either Go Back or Proceed button). This two tap process can be converted into a single tap process with a change in work-flow which has been pipelined for implementation in the next version. 4. RESULT EVALUATION 4.1. Data collection We have collected data of English SMS from Almeida [64] as shown in Table 3. Table 3. English text messages. Total SMS  Smishing SMS  Legitimate SMS  5578  747  4831  Total SMS  Smishing SMS  Legitimate SMS  5578  747  4831  View Large Table 3. English text messages. Total SMS  Smishing SMS  Legitimate SMS  5578  747  4831  Total SMS  Smishing SMS  Legitimate SMS  5578  747  4831  View Large In addition, we gathered no-English data from Yadav et al. [65] as shown in Table 4. Table 4. Non-English text message. Total SMS  Smishing SMS  Legitimate SMS  1893  898  995  Total SMS  Smishing SMS  Legitimate SMS  1893  898  995  View Large Table 4. Non-English text message. Total SMS  Smishing SMS  Legitimate SMS  1893  898  995  Total SMS  Smishing SMS  Legitimate SMS  1893  898  995  View Large 4.2. Feature extraction From the smishing messages, the model extracted 39 different features as shown in Table 2 and the model employed machine learning algorithms to classify smishing messages from legitimate messages. Therefore, the model generated a matrix where each row corresponds to messages, and each column corresponds to the features. Each cell in the matrix represents the value of the corresponding features in the corresponding message. Assume that D=d1,d2,…,dn denotes the n documents and F=f1,f2,…,fn be the n feature vector space. Consider the cij is the value of jth feature of ith document. Therefore, each document represents Ci=(ci1,ci2,…,cin) and each feature represent Cj=(c1j,c2j,…,cnj) where i=1,2,…F and j=1,2,…D. 4.3. Feature selection algorithm There are R dimensional features in feature spaces. Assume that F={fi∣i=0,1,2,…n} be the original set of features where F⊆Rn and F≠0. In this way, the model evaluated the PCC scores ( Y) of all the features with the decision attribute ( di) using the following equation:   yi←PCC(fi,d) (14)where fi∈F and yi∈Y, Y={y0,y1,y2…yn} and di denotes the level of fi; i=1,2,…n. On the preface of the score, the model employed the sequential forward algorithm [66–68] which is shown in Algorithm 1. Algorithm 1 SmiDCA algorithm   Input: input Y={y0,y1,y2…yn}, x←∅, k←0, flag←0, Tacc←0, Tmax:threshold    Output: Xk={xj∣j=1,2,…k;xj∈Y}  1  x+←max(yi)  2  xk+1←xk+x+  3  Cacc←Acc(xk+1)  4  if Cacc>Taccthen  5678  flag←0Tacc←Cacck←k+1Gotostep1  9  else  10111213141516  ifflag>TmaxthenXk←CaccreturnXkelseflag←flag+1Gotostep1end  17  end    Input: input Y={y0,y1,y2…yn}, x←∅, k←0, flag←0, Tacc←0, Tmax:threshold    Output: Xk={xj∣j=1,2,…k;xj∈Y}  1  x+←max(yi)  2  xk+1←xk+x+  3  Cacc←Acc(xk+1)  4  if Cacc>Taccthen  5678  flag←0Tacc←Cacck←k+1Gotostep1  9  else  10111213141516  ifflag>TmaxthenXk←CaccreturnXkelseflag←flag+1Gotostep1end  17  end  Algorithm 1 SmiDCA algorithm   Input: input Y={y0,y1,y2…yn}, x←∅, k←0, flag←0, Tacc←0, Tmax:threshold    Output: Xk={xj∣j=1,2,…k;xj∈Y}  1  x+←max(yi)  2  xk+1←xk+x+  3  Cacc←Acc(xk+1)  4  if Cacc>Taccthen  5678  flag←0Tacc←Cacck←k+1Gotostep1  9  else  10111213141516  ifflag>TmaxthenXk←CaccreturnXkelseflag←flag+1Gotostep1end  17  end    Input: input Y={y0,y1,y2…yn}, x←∅, k←0, flag←0, Tacc←0, Tmax:threshold    Output: Xk={xj∣j=1,2,…k;xj∈Y}  1  x+←max(yi)  2  xk+1←xk+x+  3  Cacc←Acc(xk+1)  4  if Cacc>Taccthen  5678  flag←0Tacc←Cacck←k+1Gotostep1  9  else  10111213141516  ifflag>TmaxthenXk←CaccreturnXkelseflag←flag+1Gotostep1end  17  end  The sequential forward algorithm begins with an empty set of features ( x←∅) which subset size is zero ( k←0) and subsequently adds features one by one. In the algorithm, the model adds the features which have the highest linear correlation coefficient score ( x+←max(yi)); however, the significant downside of this algorithm, which places to stop, as there is the extensive number of features in the feature’s corpus. Therefore, the model incorporates machine learning algorithm and manages one threshold ( Tmax). The model initially allots target accuracy score as zero i.e. ( Tacc←0), subsequent to adding new features to the set ( xk) and evaluates the accuracy ( Cacc←Acc(xk+1)). If the current accuracy score is more than the target accuracy score, then assigns the current accuracy score to the target accuracy ( Tacc←Cacc) otherwise increase the flag value. If the flag value is above than the threshold (Tmax), then terminate the loop and return the best features set. 4.4. Evaluation metrics The proposed method employs a set of metrics to measure the performance using four machine learning classifications. Assumed, Nham is the number of legitimate messages and Nphish is the number of smishing messages. The four parameters are used to compute the metrics that is, Nphish→phish=TP: number of smishing messages correctly classified to smishing messages, Nham→ham=TN : number of legitimate messages correctly classified to legitimate messages, Nphish→ham=FP: number of phishing messages are classified to legitimate messages, Nham→phish=FN: number of legitimate messages are classified to smishing messages. The four performance metrics are shown below: Accuracy: Overall correctly classified is shown in the following equation:   Accuracy=TP+TNTP+TN+FP+FN (15) Precision: The precision is defined as the number of true positives ( TP) over the number of true positives ( TP) plus the number of false positives ( FP) is shown in the following equation:   Precision=TPTP+FP (16) Recall: The recall is defined as the number of true positives (TP) over the number of true positives ( TP) plus the number of false negatives ( FN) is given in the following equation:   Recall=TPTP+FN (17) F1-score: The F1-score is defined as the harmonic mean of precision and recall is given in the following equation:   F1-score=2Precision.RecallPrecision+Recall (18) 4.5. Experimental result Once the feature extraction has been completed, the model evaluated the performance using two algorithms; one is Before feature selection algorithm (BFSA) which selects all the features without selecting relevant or irrelevant features and evaluates the performances. The second one is After features selection algorithm (AFSA) employs the feature selection algorithm and evaluates the performances. The model evaluated the performance using 10-fold cross-validation (Section 3.2.1) of four well-known machine learning classifiers as explained in Section 3.2. At first, the model evaluated the performances of BFSA and the result of the experiment is shown in Table 5. The result of the BFSA shows that random forest classifier has evaluated superior accuracy among other’s classifiers. Therefore, random forest classifier was chosen in the second algorithm that is AFSA, and comparison was done between BFSA and AFSA based on the dimension of features and performance with the same learning algorithm. Table 5. Performance of the BFSA. Classifiers  Precision  Recall  F1-scores  Accuracy  Random Forest classifier  94.71  77.4  85.1  96.40  DecisionTree classifier  81.23  80.88  81.0  94.91  AdaBoost classifier  86.98  80.75  83.68  95.79  Support vector machine classifier  87.78  78.21  82.64  95.59  Classifiers  Precision  Recall  F1-scores  Accuracy  Random Forest classifier  94.71  77.4  85.1  96.40  DecisionTree classifier  81.23  80.88  81.0  94.91  AdaBoost classifier  86.98  80.75  83.68  95.79  Support vector machine classifier  87.78  78.21  82.64  95.59  View Large Table 5. Performance of the BFSA. Classifiers  Precision  Recall  F1-scores  Accuracy  Random Forest classifier  94.71  77.4  85.1  96.40  DecisionTree classifier  81.23  80.88  81.0  94.91  AdaBoost classifier  86.98  80.75  83.68  95.79  Support vector machine classifier  87.78  78.21  82.64  95.59  Classifiers  Precision  Recall  F1-scores  Accuracy  Random Forest classifier  94.71  77.4  85.1  96.40  DecisionTree classifier  81.23  80.88  81.0  94.91  AdaBoost classifier  86.98  80.75  83.68  95.79  Support vector machine classifier  87.78  78.21  82.64  95.59  View Large In AFSA, the model selected relevant features using correlation algorithms is explained in Section 3.3. Subsequently, the model employed the sequential forward algorithm (as shown in Algorithm 1) to ascertain the best feature subset. In this algorithm, the model was adding features according to the top relevant features and stopped adding at that place where no improvement has happened. This means that after that point, the accuracy was continuously decreasing. Figure 10 shows the performance of AFSA. In this FIGURE, the x-axis shows the feature addition, and the y-axis shows the performance of the feature’s subsets. Figure 10. View largeDownload slide AFSA performance. Figure 10. View largeDownload slide AFSA performance. The experiment shows that the BFSA selected 39 features to evaluate 96.40% and AFSA selected only 20 features that evaluated the accuracy 96.16% (as shown in Table 6). This table demonstrates that both the algorithms evaluated the similar accuracy; however, the BFSA evaluated slightly greater than AFSA. The rate of reduction of the dimension of features is shown in the following equation:   Rateofreduction=Tf−TrTf*100 (19)where Tf is the original dimension of features, Tr be the dimension of features was reduced by feature’s selection algorithm. Using this equation, we could compute the rate of reduction=(39−2039)*100=48.72%. Therefore, the feature selection algorithm reduced the feature dimensions is 48.72%. Table 6. Comparison between BFSA and AFSA. Features selection Algorithms  Features are selected for classification  Number of features used in classification  Accuracy (%)  BFSA  {f1,f2,f3……f38,f39}  39  96.40  AFSA  {f24,f39,f15,f34,f38,f29,f23,f1,f3,f9, f30,f17,f32,f4,f36,f27,f37,f21,f33,f14}  20  96.16  Features selection Algorithms  Features are selected for classification  Number of features used in classification  Accuracy (%)  BFSA  {f1,f2,f3……f38,f39}  39  96.40  AFSA  {f24,f39,f15,f34,f38,f29,f23,f1,f3,f9, f30,f17,f32,f4,f36,f27,f37,f21,f33,f14}  20  96.16  View Large Table 6. Comparison between BFSA and AFSA. Features selection Algorithms  Features are selected for classification  Number of features used in classification  Accuracy (%)  BFSA  {f1,f2,f3……f38,f39}  39  96.40  AFSA  {f24,f39,f15,f34,f38,f29,f23,f1,f3,f9, f30,f17,f32,f4,f36,f27,f37,f21,f33,f14}  20  96.16  Features selection Algorithms  Features are selected for classification  Number of features used in classification  Accuracy (%)  BFSA  {f1,f2,f3……f38,f39}  39  96.40  AFSA  {f24,f39,f15,f34,f38,f29,f23,f1,f3,f9, f30,f17,f32,f4,f36,f27,f37,f21,f33,f14}  20  96.16  View Large The model computed the precision and the recall of both BFSA and AFSA. F1-Measure was used to evaluate the efficiency is shown in Fig. 11. Although the accuracy of BFSA is marginally ahead of AFSA, the efficiency of AFSA is better than BFSA. Figure 11. View largeDownload slide Efficiency of both the algorithms. Figure 11. View largeDownload slide Efficiency of both the algorithms. 4.6. Performance of non-English dataset Figure 12 shows the performance of the proposed model on a standard Indian language dataset [65]. In the experiment, we have found that the model achieved the accuracy 90.33% with Random Forest algorithm, which is slightly lesser than English language dataset. However, we compared the performance of both datasets of English and non-English text messages, which are illustrated in Fig. 13. It can be inferred from the metrics that proposed model has the potential to handle non-English language as well with an acceptable accuracy level. Figure 12. View largeDownload slide Non-English text-messages performance. Figure 12. View largeDownload slide Non-English text-messages performance. Figure 13. View largeDownload slide Comparison between English and non-English performances. Figure 13. View largeDownload slide Comparison between English and non-English performances. 5. DISCUSSION This paper has proposed a novel anti-smishing model entitled SmiDCA to detect smishing messages. The primary objective of the proposed model was to classify smishing messages from legitimate messages. The proposed model examined the smishing messages and extracted 39 features. Machines learning models were constructed with and without dimensional reduction of features. The correlation algorithm evaluated the linear correlation score of all the features and arranged according to the highest correlation score. As we have several features, which have the highest correlation scores, therefore, the model employed the sequential forward feature selection algorithm to select the best feature’s subset. The sequential forward feature selection was used to add features one by one. However, the model faced one issue regarding the upper limit on the number of features that shall be added to the set. Therefore, to overcome this issue, the model employed machine learning algorithm with the threshold value as explained in Section 4.3. Our experiment started by incrementally adding features. Initially, with three features, it gave the accuracy of 88.98% (three features). Subsequently, we added three more features and found the accuracy 93.44% (six features) and in this manner, the model gradually increased the feature’s size and evaluated the accuracy as follows: 93.69% (eight features), 94.34% (12 features), 94.71% (15 features), 95.64% (18 features) and finally we had 96.16% with 20 features. If we had added more features to the feature’s set, afterward the accuracy reduced 96.09% (21 features), 96.02% (22 features) and 95.82% (23 features). Therefore, we terminated the investigation because the proposed model maintained the policy that if the model found continuously same accuracies or decrease in accuracy for three iterations, then the expansion process requires to be terminated. The result of the experiment showed that the feature’s subset that has 20 features evaluated superior accuracy. However, the comparison between BFSA and AFSA shows that the BFSA evaluated the accuracy slightly better than AFSA. If we compare the feature quantity, then we have found that the BFSA selected 39 features, and AFSA selected only 20 features. Hence, from the experiments, we came to realize that more features are unnecessary to evaluate the better accuracy, it relies upon the most effective feature’s combination. In addition, the feature selection algorithm reduced the dimension of features 40.71% with the assist of the correlation algorithm and the sequential forward technique for dimensionality reduction. Additionally, if we concentrate on the efficiency of both the algorithms, then we have found that the AFSA offered better performance than BFSA. Moreover, the model carried out the experiment on non-English data as well and evaluated the accuracy 90.33% in Random Forest algorithm as compared with the English dataset 96.40%. In order to verify the performance of the model, a comparative analysis is provided between English and non-English datasets. It was inferred that the model can handle both English and non-English datasets. 6. CONCLUSION Smishing is a critical cyber attack which is increasing in the recent years. Hence, this paper proposed a novel anti-smishing model titled SmiDCA which analyzed the smishing messages and extracted 39 features to detect smishing messages. In addition, four well-known machines learning algorithms were applied to classify the smishing messages from legitimate messages. The result of the experiment shows that the model evaluated the accuracy of 96.40% with the assistance of Random Forest classifier in BFSA. Furthermore, the model experimented with the accuracy with non-English language dataset and evaluated the accuracy 90.33% in Random Forest algorithm. From the results of the investigation, it can be concluded that the model has the capacity to deal with both English and non-English datasets. The model minimized the English-dataset dimension of features by 40.71% and accuracy was evaluated as 96.16% in AFSA. In addition, the efficiency of both the algorithms shows that the AFSA provided better efficiency than BFSA. In future, we will add more features, and use novel feature selection algorithms for enhancing the accuracy. The deep learning methodologies-based approach shall be explored as a future direction. In addition, the user interface of the prototype can also be enhanced further. REFERENCES 1 Michael ( 2016) eretailers: Text consumers or risk irrelevancy in 2017 (accessed on 2017). 2 Statisticbrain ( 2017) Text message statistics (accessed on 2018). 3 Silva, R.M., Alberto, T.C., Almeida, T.A. and Yamakami, A. ( 2017) Towards filtering undesired short text messages using an online learning approach with semantic indexing. Expert Syst. Appl. , 83, 314– 325. Google Scholar CrossRef Search ADS   4 Hong, T., Choi, C. and Shin, J. ( 2018) Cnn-based malicious user detection in social networks. Concurr. Comput. Pract. Exp. , 30, e4163-n/a. e4163 CPE-17-0052.R3. Google Scholar CrossRef Search ADS   5 Mun, H.-J. and Han, K.-H. ( 2016) Blackhole attack: user identity and password seize attack using honeypot. J. Comput. Virol. Hacking Tech. , 12, 185– 190. Google Scholar CrossRef Search ADS   6 Sonowal, G., Kuppusamy, K.S. and Kumar, A. ( 2017) Usability evaluation of active anti-phishing browser extensions for persons with visual impairments. 2017 4th Int. Conf. Advanced Computing and Communication Systems (ICACCS), January pp. 1–6. IEEE. 7 McAfee ( 2012) Protect yourself from smishing (accessed on 2017). 8 genisyscu ( 2017) Smishing - text messaging scams (accessed on 2017). 9 Hiremath, R., Malle, M. and Patil, P. ( 2016) Cellular network fraud & security, jamming attack and defenses. Proc. Comput. Sci. , 78, 233– 240. Google Scholar CrossRef Search ADS   10 Clxcommunications ( 2016) Mobile message fraud report (accessed on 2017). 11 Delany, S.J., Buckley, M. and Greene, D. ( 2012) Sms spam filtering: methods and data. Expert Syst. Appl. , 39, 9899– 9908. Google Scholar CrossRef Search ADS   12 Baglia, M. ( 2015) Text marketing vs. email marketing: which one packs a bigger punch (accessed on 2017). 13 Canova, G., Volkamer, M., Bergmann, C., Borza, R., Reinheimer, B., Stockhardt, S. and Tenberg, R. ( 2015) Learn to spot phishing urls with the android nophish app. IFIP World Conf. Information Security Education, pp. 87–100. Springer. 14 Kang, A., Dong Lee, J., Kang, W.M., Barolli, L. and Park, J.H. ( 2014) Security Considerations for Smart Phone Smishing Attacks. In Jeong, H.Y., Obaidat, M.S., Yen, N.Y. and Park, J.J.J.H. (eds.) Advances in Computer Science and its Applications , pp. 467– 473. Springer, Berlin, Heidelberg. Google Scholar CrossRef Search ADS   15 Shahriar, H., Klintic, T. and Clincy, V. ( 2015) Mobile phishing attacks and mitigation techniques. J. Inf. Secur. , 6, 206. 16 Moon, S.-h. and Park, D.-w. ( 2016) Forensic analysis of mers smishing hacking attacks and prevention. Int. J. Secur. Appl. , 10, 181– 192. 17 Mahmoud, T.M. and Mahfouz, A.M. ( 2012) Sms spam filtering technique based on artificial immune system. IJCSI Int. J. Comput. Sci. Issues , 9, 589– 597. 18 Belabed, A., Aïmeur, E. and Chikh, A. ( 2012) A Personalized Whitelist Approach for Phishing Webpage Detection. 2012 Seventh Int. Conf. Availability, Reliability and Security, August, pp. 249–254. IEEE. 19 Kang, J. and Lee, D. ( 2007) Advanced White List Approach for Preventing Access to Phishing Sites. 2007 Int. Conf. Convergence Information Technology (ICCIT 2007), November, pp. 491–496. IEEE. 20 Sharifi, M. and Siadati, S.H. ( 2008) A Phishing Sites Blacklist Generator. 2008 IEEE/ACS Int. Conf. Computer Systems and Applications, March, pp. 840–843. IEEE. 21 Prakash, P., Kumar, M., Kompella, R.R. and Gupta, M. ( 2010) Phishnet: Predictive Blacklisting to Detect Phishing Attacks. 2010 Proc. IEEE INFOCOM, March, pp. 1–5. IEEE. 22 Gastellier-Prevost, S., Granadillo, G.G. and Laurent, M. ( 2011) Decisive Heuristics to Differentiate Legitimate from Phishing Sites. 2011 Conf. Network and Information Systems Security, May, pp. 1–9. IEEE. 23 Cao, Y., Han, W. and Le, Y. ( 2008) Anti-phishing based on Automated Individual White-List. Proceedings of the 4th ACM Workshop on Digital Identity Management, New York, NY, USA DIM ‘08, pp. 51–60. ACM. 24 Mohammad, R.M., Thabtah, F. and McCluskey, L. ( 2014) Intelligent rule-based phishing websites classification. IET Inf. Secur. , 8, 153– 160. Google Scholar CrossRef Search ADS   25 Sonowal, G. and Kuppusamy, K. ( 2017) Phidma—a phishing detection model with multi-filter approach. J. King Saud Univ. Comput. Inf. Sci. , 29, 1– 15. Google Scholar CrossRef Search ADS   26 Banu, M.N. and Banu, S.M. ( 2013) A comprehensive study of phishing attacks. Int. J. Comput. Sci. Inf. Technol. , 4, 783– 786. 27 Kang, J.W., Lee, A.R. and Kim, B. ( 2016) Improving security awareness about smishing through experiment on the optimistic bias on risk perception. J. Korea Inst. Inf. Secur. Cryptology , 26, 475– 487. Google Scholar CrossRef Search ADS   28 Baslyman, M. and Chiasson, S. ( 2016) ‘Smells phishy?’: An Educational Game about Online Phishing Scams. 2016 APWG Symp. Electronic Crime Research (eCrime), June, pp. 1–11. IEEE. 29 Mun, H.-J. and Li, Y. ( 2017) Secure short url generation method that recognizes risk of target url. Wireless Pers. Commun. , 93, 269– 283. Google Scholar CrossRef Search ADS   30 Baek, M., Lee, Y. and Won, Y. ( 2017) Property Analysis of SMS Spam using Text Mining. In Park, J.J.J.H., Chen, S.-C. and Raymond Choo, K.-K. (eds.) Advanced Multimedia and Ubiquitous Engineering , pp. 67– 73. Springer, Singapore. Google Scholar CrossRef Search ADS   31 Nair, A.E.S. ( 2013) Disributed System for Smishing Detection . Southern Methodist University, Dallas, TX. Google Scholar PubMed PubMed  32 Lee, A., Kim, K., Lee, H. and Jun, M. ( 2016) A Study on Realtime Detecting Smishing on Cloud Computing Environments. In Park, J.J.J.H., Chao, H.-C., Arabnia, H. and Yen, N.Y. (eds.) Advanced Multimedia and Ubiquitous Engineering , pp. 495– 501. Springer, Berlin, Heidelberg. Google Scholar CrossRef Search ADS   33 Foozy, C.F.M., Ahmad, R. and Abdollah, M.F. ( 2013) Phishing detection taxonomy for mobile device. Int. J. Comput. Sci. Issues (IJCSI) , 10, 338– 344. 34 Pandey, M. and Ravi, V. ( 2012) Detecting Phishing E-mails using Text and Data Mining. 2012 IEEE Int. Conf. Computational Intelligence and Computing Research, December, pp. 1–6. IEEE. 35 Joo, J.W., Moon, S.Y., Singh, S. and Park, J.H. ( 2017) S-detector: an enhanced security model for detecting smishing attack for mobile computing. Telecommun. Syst. , 66, 1– 10. Google Scholar CrossRef Search ADS   36 El-Alfy, E.-S.M. ( 2017) Detection of phishing websites based on probabilistic neural networks and k-medoids clustering. Comput. J. , 60, 1– 15. Google Scholar CrossRef Search ADS   37 Pinterest ( 2017) Smishing dataset (accessed on 2017). 38 Shams, R. and Mercer, R.E. ( 2013) Classifying Spam Emails using Text and Readability Features. 2013 IEEE 13th Int. Conf. Data Mining (ICDM), pp. 657–666. IEEE. 39 Han, Y. and Shen, Y. ( 2016) Accurate Spear Phishing Campaign Attribution and Early Detection. Proc. 31st Annu. ACM Symp. Applied Computing, pp. 2079–2086. ACM. 40 Smith, E.A. and Senter, R. ( 1967) Automated readability index. Technical report. 41 Flesch, R. ( 1948) A new readability yardstick. J. Appl. Psychol. , 32, 221. Google Scholar CrossRef Search ADS PubMed  42 Gunning, R. ( 1952) The Technique of Clear Writing . McGraw-Hill, New York. 43 Mc Laughlin, G.H. ( 1969) Smog grading – a new readability formula. J. Read. , 12, 639– 646. 44 Coleman, M. and Liau, T.L. ( 1975) A computer readability formula designed for machine scoring. J. Appl. Psychol. , 60, 283. Google Scholar CrossRef Search ADS   45 Bird, S., Klein, E. and Loper, E. ( 2009) Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit . O’Reilly Media. 46 Bird, S. and Loper, E. ( 2004) Nltk: The Natural Language Toolkit. Proc. ACL 2004 on Interactive Poster and Demonstration Sessions, Stroudsburg, PA, USA ACLdemo ‘04. Association for Computational Linguistics. 47 Duman, S., Kalkan-Cakmakci, K., Egele, M., Robertson, W. and Kirda, E. ( 2016) Emailprofiler: Spearphishing Filtering with Header and Stylometric Features of Emails. Computer Software and Applications Conference (COMPSAC), 2016 IEEE 40th Annual, pp. 408–416. IEEE. 48 Keretna, S., Hossny, A. and Creighton, D. ( 2013) Recognising User Identity in Twitter Social Networks via Text Mining. 2013 IEEE Int. Conf. Systems, Man, and Cybernetics, October, pp. 3079–3082. IEEE. 49 Akinyelu, A.A. and Adewumi, A.O. ( 2014) Classification of phishing email using random forest machine learning technique. J. Appl. Math. , 2014, 6. Google Scholar CrossRef Search ADS   50 Abu-Nimeh, S., Nappa, D., Wang, X. and Nair, S. ( 2007) A Comparison of Machine Learning Techniques for Phishing Detection. Proc. Anti-phishing Working Groups 2Nd Annual eCrime Researchers Summit, New York, NY, USA eCrime ‘07, pp. 60–69. ACM. 51 Ho, T.K. ( 1995) Random Decision Forests. Proc. Third Int. Conf. Document Analysis and Recognition, 1995, pp. 278–282. IEEE. 52 Basnet, R.B. and Sung, A.H. ( 2010) Classifying Phishing Emails using Confidence-Weighted Linear Classifiers. Int. Conf. Information Security and Artificial Intelligence (ISAI), pp. 108–112. IEEE. 53 Toolan, F. and Carthy, J. ( 2009) Phishing Detection using Classifier Ensembles. 2009 eCrime Researchers Summit, Sept, pp. 1–9. IEEE. 54 Fette, I., Sadeh, N. and Tomasic, A. ( 2007) Learning to Detect Phishing Emails. Proc. 16th Int. Conf. World Wide Web, New York, NY, USA WWW ‘07, pp. 649–656. ACM. 55 Huang, H., Qian, L. and Wang, Y. ( 2012) A svm-based technique to detect phishing urls. Inf. Technol. J. , 11, 921– 925. Google Scholar CrossRef Search ADS   56 Yearwood, J., Mammadov, M. and Banerjee, A. ( 2010) Profiling Phishing Emails based on Hyperlink Information. 2010 Int. Conf. Advances in Social Networks Analysis and Mining, August, pp. 120–127. IEEE. 57 Islam, R. and Abawajy, J. ( 2013) A multi-tier phishing detection and filtering approach. J. Netw. Comput. Appl. , 36, 324– 335. Google Scholar CrossRef Search ADS   58 Ramanathan, V. and Wechsler, H. ( 2012) phishgillnet—phishing detection methodology using probabilistic latent semantic analysis, adaboost, and co-training. EURASIP J. Inf. Secur. , 2012, 1. Google Scholar CrossRef Search ADS   59 Ramanathan, V. and Wechsler, H. ( 2013) Phishing detection and impersonated entity discovery using conditional random field and latent dirichlet allocation. Comput. Secur. , 34, 123– 139. Google Scholar CrossRef Search ADS   60 Benesty, J., Chen, J., Huang, Y. and Cohen, I. ( 2009) Pearson Correlation Coefficient. Noise Reduction in Speech Processing , pp. 1– 4. Springer, Berlin, Heidelberg. Google Scholar CrossRef Search ADS   61 Inomata, A., Rahman, M., Okamoto, T. and Okamoto, E. ( 2005) A Novel Mail Filtering Method Against Phishing. PACRIM. 2005 IEEE Pacific Rim Conf. Communications, Computers and signal Processing, 2005. August, pp. 221–224. IEEE. 62 Guyon, I. and Elisseeff, A. ( 2003) An introduction to variable and feature selection. J. Mach. Learn. Res. , 3, 1157– 1182. 63 Guyon, I., Nikravesh, M., Gunn, S. and Zadeh, L.A. ( 2006) An Introduction to Feature Extraction. Feature Extraction: Foundations and Applications . Springer, Berlin, Heidelberg. 64 Almeida, T.A. ( 2017) Ham and spam dataset (accessed on 2017). 65 Yadav, K., Kumaraguru, P., Goyal, A., Gupta, A. and Naik, V. ( 2011) Smsassassin: Crowdsourcing Driven Mobile-based System for SMS Spam Filtering. Proc. 12th Workshop on Mobile Computing Systems and Applications, New York, NY, USA HotMobile ‘11, pp. 1–6. ACM. 66 Mohd, F., Bakar, Z.A., Noor, N.M.M., Rajion, Z.A. and Saddki, N. ( 2015) A Hybrid Selection Method based on HCELFS and SVM for the Diagnosis of Oral Cancer Staging. In Sulaiman, H.A., Othman, M.A., Othman, M.F.I., Rahim, Y.A. and Pee, N.C. (eds.) Advanced Computer and Communication Engineering Technology , pp. 821– 831. Springer International Publishing, Cham. 67 Hassan, D. ( 2015) On determining the most effective subset of features for detecting phishing websites. Int. J. Comput. Appl. , 122, 1– 7. 68 Wang, W.-L., Liu, P.-Y. and Liu, K.-F. ( 2006) Feature selection algorithm in email classification. Jisuanji Gongcheng yu Yingyong (Comput. Eng. Appl.) , 42, 122– 124. Footnotes 1 https://www.cnet.com/news/protect-yourself-from-smishing-video/ Author notes Handling editor: Steven Furnell © The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png The Computer Journal Oxford University Press

SmiDCA: An Anti-Smishing Model with Machine Learning Approach

Loading next page...
 
/lp/ou_press/smidca-an-anti-smishing-model-with-machine-learning-approach-qKJxS5CqSd
Publisher
Oxford University Press
Copyright
© The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
0010-4620
eISSN
1460-2067
D.O.I.
10.1093/comjnl/bxy039
Publisher site
See Article on Publisher Site

Abstract

Abstract Phishing has become a serious cyber-security issue, and it is spreading through various media such as e-mail, SMS to capture the victim’s critical profile information. Although many novel anti-phishing techniques have been developed to forestall the progress of phishing, it remains an unresolved issue. Smishing is an incarnation of Phishing attack, which utilizes Short Messaging Service (SMS) or simple text message on mobile phones to lure the victim’s online credentials. This paper presents an anti-phishing model entitled ‘SmiDCA’ (SMIshing Detection based on Correlation Algorithm). The proposed model has collected different smishing messages from various sources, and 39 distinct features were extracted initially. The SmiDCA model incorporates dimensionality reduction, and machine Learning-based experiments were conducted on without (BFSA) and with (AFSA) reduction of features. The model has been validated with experiments on both the English and non-English datasets and the results of both of these experiments are encouraging in terms of accuracy: 96.40% for English dataset and 90.33% for the non-English dataset. In addition, the model achieved an accuracy of 96.16% even after nearly half of the features were pruned. 1. INTRODUCTION Short Message Service (SMS) or simple text message is an electronic messaging service which has become a major communication channel to share shorter text, among the mobile phone users. The SMS is used for both personal and official purposes. It has become a common practice among well-known organizations to consume SMS for communicating with their customers and 75% of persons incline toward SMS communications for deliveries, promotions and surveys [1]. The increasing trend of Text Message Statistics by Statistic Brain Research Institute which was released on 17 September 2017, is as shown in Fig. 1 [2]. With the exponential growth of text-messages, violations, such as phishing message, spams, are also increasing. It has been proven by various studies that apart from being annoying to the users, these lead to the vast amount of monetary harm for both people and organizations [3–6]. Figure 1. View largeDownload slide Number of text messages (in billion). Figure 1. View largeDownload slide Number of text messages (in billion). Smishing is a variant of phishing in which smishers (an attacker who uses SMS for phishing) send text messages to the victim’s smartphone that appears similar to genuine messages. Many users fall prey to these types of messages and disclose their sensitive credentials such as user id, password [7–9]. The Mobile Messaging Fraud Report reveals that 28% of SMS users receive an unsolicited text message every day [10]. According to security firm Cloudmark, around 30 million smishing messages are sent to mobile users across North America, Europe.1 Most of the attackers favor text messages instead of e-mails to attack victims because text messages are cheaper and with tiny SMS package, they are able to send a large number of messages to the victims [11]. Moreover, the text messages have the higher response rates than e-mails. According to the business2community, the average open rate of a text message sits approximately 99%, in comparison of e-mail ranges from 28% to 33% and in addition, the ‘click’ rate of the included link for the text message is approximate 36%, while for e-mail it is between 6% and 7% [12]. In addition, it is crucially hard to distinguish whether the URL is legitimate URL or smishing URL by looking at the comparatively smaller-sized display in the mobile phones [13]. Smishing attack broadly involves: any of the following two activities [14–16]: Phone conversation: The attackers send smishing message to the victims with respect to purchase points of interest, exchanges, discounts and others alongside a phone number so that the victims answer through the phone number, and attackers request their credentials. Embedded URL: In this technique, the attackers insert malicious code into URL into the text messages and send to the victims. Once the victims verify the messages and click the URL then the malicious code installs into the victim’s phone. In addition, some attackers embedded phishing website URL to the text messages. The literature revels that there are a few strategies that concentrate on solely analyzing the smishing messages and providing the significant techniques to detect the smishing messages. Among all the existing techniques, most researchers prefer whitelist [17–19] and blacklist approach [20, 21] to combat phishing attack. However, the major drawback of the blacklist is that it requires more time to update the list [22]. Phishing SMS very quickly accomplished their task and removes their connection. The whitelist is somehow is useful for the real-time experiment; whereas, the major problem of whitelist, the number which is not on the whitelist is regarded as phishing SMS [23]. Hence, we proposed a model based on heuristics approach [24, 25] which is fast and has a low false-negative rate. In the heuristic-based approach, the model extracts the discriminating features from both the legitimate SMS and the smishing SMS and builds a training dataset. When the users receive a new SMS, the model compares the SMS with the training dataset using machine learning algorithm and predicts the message, whether smishing or legitimate. However, the major drawback of this approach is the size of features because there is a large number of features in the feature’s space in which some of the relevant, and some are irrelevant to the particular task. Therefore, the motivation of the paper is to develop a model that detects smishing messages by extracting specific features from smishing messages. Further, the features selection method is applied to filter-out the less relevant features from the feature’s corpus. The reason for stopping the phishing attack at the message’s level instead of at the website level is a prevention strategy. For website strategy to work, the users require to click the website link, whereas most of the links in the attack messages contain spam file instead of the actual link to the website. On clicking the link, the spam shall be installed automatically on the user’s mobile. Hence, to ceasing phishing at message’s level is an optimal solution than the detection at the website level. The major objectives of this paper are as follows: To detect smishing messages with the proposed anti-smishing model entitled SmiDCA. To extract several features from the smishing messages using text mining, NLP, readability algorithms. To select relevant features from several features using the correlation algorithm. To implement in real-time app on the machine, a prototype of the model is provided from the provider point of view. To search the effective features set from the relevant features, by incorporating learning algorithms along with the search algorithm. To verify the model, both English and non-English datasets have experimented. The remainder of this paper is organized as follows: Section 2 presents an overview of the related works. The proposed model is explained in Section 3. Section 4 shows the result of the experiment. Section 5 discusses the outcome of the experimental result. The summary of the paper is depicted in the section 6. 2. RELATED WORK Smishing is gaining increased attention among the researchers in the recent years [26, 27]. Many of the studies have emphasized on the importance of awareness education in order to combat the smishing messages [13, 28]. Few of the studies have proposed models to detect smishing messages. The short URL is one of the novels phishing or smishing attack where users are unable to perceive the features of the linked information or data, and it is exceptionally hard to verify which file or web page the short URL interfaces to users. A novel method was proposed [29] which composes the destination information when generating a short URL so that the user can verify whether the destination is a web page or a file. On analyzing the web page, the method measures and evaluates the risk of the web page and decides whether to block the short URL as per threshold, which prevents attacks. Several researchers have provided multi-filter models by amalgamating multiple approaches to prevent phishing. One recent model ‘PhiDMA’ was proposed by [25] and this model has incorporated five layers such as Auto upgrade whitelist layer, URL feature’s layer, Lexical signature layer, String matching layer and Accessibility Score comparison layer. Moreover, the authors developed a prototype of this model, which assists the persons with visual impairments by including the accessible interface. The result of the experiments shows that the model evaluated the accuracy 92.72%. Another study was carried out by Baek et al. [30] on the premise of time periods during the day and found that the highest peaks of spam messages were sent at 10 am and 4 pm. They, likewise, conducted one study regarding to the frequency of words and showed that the word candidate, congressmen, election, candidate number and information were very high frequency. Finally, they were analyzing the contents of the spam sent by each spammer in order to distinguish the smishing. For identifying the smishing message, they searched the keyword which is associated with the URL and if the keyword was found, then it was regarded as smishing. In another study to detect smishing messages [31], the authors collected seven features such as words, size, misspelled, part-of-speech, the presence of phone number and URL. The result of the experiment shows that Random Forest (RF) classifier secured the accuracy of 92%. Cloud-based virtual environment was employed to detect smishing messages or detect unknown malware [32]. The authors identified the suspicious URL by checking whether it belongs to the downloading APK record or an application without source. The method, likewise, increments the likelihood smishing identification by utilizing filtering. A mobile phone-specific anti-phishing solution was presented by Foozy et al. [33]. They identified the important criterion to improve the techniques to combat against phishing attack on mobile. In this paper, the authors have mentioned that the attackers incorporated mobile device phishing like Bluetooth phishing, Smishing, Vishing, mobile web application phishing, etc. In addition, Pandey et al. [34] have presented text mining and data mining-based technique, which extracted 23 keywords from phishing and non-phishing e-mails. Subsequently, they selected 12 keywords using t-statistic-based feature selection and conducted the experiment by multiple machine learning classifications. One of the most-recent methods named S-Detector (Smishing detector) was proposed to differentiate the normal text message from Smishing messages [35]. This model employed the morphological analyzer to extract the noun words which were frequently used in Smishing text messages and Naïve Bayesian classifier was used to filter. The result of the experiment shows that this model provides security, availability and reliability in preventing more intelligent and more malicious security threats. A novel approach based on probabilistic neural networks (PNNs) and K-medoids clustering [36] to detect phishing websites was proposed. Principal component analysis (PCA) was applied to reduce the dimensionality of the feature space. Finally, the model carried out the experiment with 30 features, and the results show that the proposed model achieved near 40% reduction in the complexity, approximate 97% accuracy. In our proposed anti-smishing model SmiDCA, 39 specific features were extracted by analyzing the smishing messages. Well-known machine learning algorithms were applied to the feature set built using the aforementioned features, to evaluate the performance. In addition, the model incorporated feature’s selection algorithm to reduce the dimension of the features and compare the performance before feature selection algorithm and after the feature selection algorithm. On experimenting with the non-English text messages dataset, it can be concluded that the model has the tendency to detect non-English smishing messages as well. 3. METHODOLOGY The methodology of the SmiDCA is as shown in Fig. 2. The model initially investigates the data and extracts the distinct features. Subsequently, the model ranks all the features by exploiting correlation algorithm and generates the subset by adding of high ranked features one by one and sending them to the Learning algorithm. The learning algorithm evaluates the accuracy and verifies whether the accuracy is better than the previous accuracy or not. If the accuracy is increased, then adds the next high ranked features to the subset; otherwise, terminates the process. Figure 2. View largeDownload slide Methodology of SmiDCA. Figure 2. View largeDownload slide Methodology of SmiDCA. 3.1. Feature analysis In this section, the model analyzes the smishing messages, which were collected from the different sources such as smishing dataset which is publicly available [37]: Words features: The word features deal with the words which are frequently employed by smishers to cheat the victims. The model carried out the following steps to find the frequent words: The model tokenized the words using white space. Converted all the words into lower characters. Eliminated the stop words such as a, an, is etc. Employed the term-frequency algorithm to find the frequent words, as shown in the following equation:  tf(t,d)=f(t,d)d (1)where f(t,d) is the frequency of term t in document d. The model evaluates the word list features by taking top 20 keywords, as shown in Table 1. URL feature: One of the stronger characteristics of smishing message is that smishers embedded URL with the smishing messages so that the victims can visit the phishing site which is comparable with the genuine site as shown in Fig. 3 [37]. In the analysis, the model has found that the URLs were present in 53.68% of smishing messages. E-mail-id feature: Another essential feature to perceive smishing message is that smishers send the text message alongside e-mail-id with the goal that they can ask the credentials through e-mail is shown in Fig. 4 [37]. The smishing e-mail-id is outlined as genuine e-mail-id so that victims get convinced. The model initially searches the special character (‘@’) in the document as every e-mail-id contains this special character. After that, the model examines the prefix and the suffix of the special character. In the examination, the model has found that 11.57% of smishing messages contained e-mail id. Phone number feature: Telephone number is also a noteworthy feature of smishing message. Often, smishers send the text message with a telephone number through which they can ask the credentials from the victims as shown in Fig. 5 [37]. The model initially investigates the numeric characters from the document. If no numeric character is available in the document, then terminate otherwise, the model finds the pattern of the phone number in the numeric characters. Size feature: One of the basic features smishing messages is the size. Figure 6 shows the histogram of different sizes of smishing messages and legitimate messages. Special character feature: Most of the legitimate organizations exclude any type special characters in their messages. However, most of the smishing messages have a special character such as ‘$100’, ‘£’ and others. The model initially searches the availability of special character in the document. If no special character is available in the document, then terminate; otherwise, the model examines whether any numeric character is succeeding the special character or not. If the numeric characters succeed the special characters, then it is regarded as a feature value. Readability text feature: Readability of text measures the ease of understanding the English text. Companies, organization and others employ their own writing style of text and most of the time; they employ trained content writers to verify the messages before sending to customers. Several researchers employed readability as features to identify phishing or spam e-mails [38, 39]. In this paper, the model has adopted the six important readability algorithms, which are as shown below: Automated readability index: The Automatic readability index is used to calculate the readability score based on the understandability of English text [40]. The equation of the automatic readability index is shown in the following equation:   ARI=4.71CW+0.5WS−21.43 (2) where C be the characters and numbers, W be the words that are, the number of spaces and S is denoted by the sentences. Flesch–Kincaid Readability test: Rudolf Flesch developed the Flesch–Kincaid Readability test which is used to indicate how difficult a text in English is to understand [41] Two tests are conducted: Flesch–Kincaid Grade Level and Flesch Reading-Ease Score. The equation of the Flesch–Kincaid Grade Level is shown in the following equation:   FKGL=0.39WS+11.8SyW−15.59 (3)where W be the total words, S be the total sentence and Sy be the total syllables. Flesch reading-ease score (FRES) test is shown in the following equation:   FRES=206.835−1.015WS−84.6SyW (4)where W be the total words, S be the total sentence and Sy be the total syllables. Gunning Fog Index: Robert Gunning, an American businessman developed this readability test [42]. The following steps are used to calculate the Gunning fog: Select a passage of around 100 words without omitting any sentences. Determine the average sentence length. (Divide the number of words by the number of sentences.) Count the ‘complex’ words three or more syllables. Add the average sentence length and the percentage of complex words. Multiply the result by 0.4.The equation of Gunning Fog Index is shown in the following equation:   GFI=0.4WS+100CWW (5)where W be the word, S be the sentence and CW is the complex word. SMOG Index: McLaughlin [43] developed this SMOG index and SMOG is widely used for checking health messages. The equation of Smog to test readability score is shown in the following equation:   SMOG=1.0430P×30S+3.1291 (6)where P be the number of Polysyllables and S be the number of sentence. Coleman–Liau Index (CLI): Coleman and Liau [44] developed the CLI to calculate the readability score. The equation of the CLI is shown in the following equation:   CLI=0.0588L−0.296S−15.8 (7)where L is the average number of letters per 100 words and S is the average number of sentences per 100 words. Misspelled word feature: The Misspelled word is additionally used for our feature’s corpus because a majority of the smishers send the misspelled messages to the victims. In this feature, the model initially tokenized on the basis of white space between two words and consequently, expelled the special characters such as {?,#…etc.} from the document. As our smishing dataset contains the American English messages, the model verifies the words based on American English dictionary. This feature is computed using the following equation:   C(m,d)=f(m,d)t(w,d) (8)where C(m,d) computes the misspell words ( m) in the document ( d). f(m,d) be the frequency of the misspelled words in document and t(w,d) be the total number of words in the document. Parts of Speech tagging features: The parts of speech tagging is used to analysis the words into the parts of speech such as noun and verb. The model employed natural language took kit (NLTK) which is a python package to identify POS [45–48]. Assume, W={w1,w2,…wn} be the n number of words in an original document and P={p1,p2,…pn} that is {noun, pronoun, …} be the part of speech. Subsequently, the model evaluates the POS tags of the words ( wi) where wi∈W and generates new tag document T={twp1,twp2,…twpn} and then counts the frequency of the tags in the document T.The summary of features is shown in Table 2. Figure 3. View largeDownload slide Smishing through URL. Figure 3. View largeDownload slide Smishing through URL. Figure 4. View largeDownload slide Smishing through e-mail-id. Figure 4. View largeDownload slide Smishing through e-mail-id. Figure 5. View largeDownload slide Smishing through phone number. Figure 5. View largeDownload slide Smishing through phone number. Figure 6. View largeDownload slide Size of smishing and legitimate messages. Figure 6. View largeDownload slide Size of smishing and legitimate messages. Table 1. Frequency of words in smishing dataset. Keywords  Frequency  Keywords  Frequency  Please  31.58  Sms  10.53  Account  18.95  Customer  9.47  Card  16.84  E-mail  9.47  Apple  16.84  Details  9.47  Update  16.84  Iphone  8.42  Online  14.74  Bank  8.42  Link  11.58  Message  8.42  Call  10.53  Store  8.42  Today  10.53  Nationwide  8.42  Refund  10.53  Due  8.42  Keywords  Frequency  Keywords  Frequency  Please  31.58  Sms  10.53  Account  18.95  Customer  9.47  Card  16.84  E-mail  9.47  Apple  16.84  Details  9.47  Update  16.84  Iphone  8.42  Online  14.74  Bank  8.42  Link  11.58  Message  8.42  Call  10.53  Store  8.42  Today  10.53  Nationwide  8.42  Refund  10.53  Due  8.42  View Large Table 1. Frequency of words in smishing dataset. Keywords  Frequency  Keywords  Frequency  Please  31.58  Sms  10.53  Account  18.95  Customer  9.47  Card  16.84  E-mail  9.47  Apple  16.84  Details  9.47  Update  16.84  Iphone  8.42  Online  14.74  Bank  8.42  Link  11.58  Message  8.42  Call  10.53  Store  8.42  Today  10.53  Nationwide  8.42  Refund  10.53  Due  8.42  Keywords  Frequency  Keywords  Frequency  Please  31.58  Sms  10.53  Account  18.95  Customer  9.47  Card  16.84  E-mail  9.47  Apple  16.84  Details  9.47  Update  16.84  Iphone  8.42  Online  14.74  Bank  8.42  Link  11.58  Message  8.42  Call  10.53  Store  8.42  Today  10.53  Nationwide  8.42  Refund  10.53  Due  8.42  View Large Table 2. Summary of the features. Features  Feature name  Data-type  Description  ⟨f1,…f20⟩  Words feature:  {0,1}  Smishing messages contain frequent words like please, Sms, Account, Customer, Card, E-mail, Apple, Details, Update, Iphone, Online, Bank, Link, Message, Call, Store, Today, Nationwide, Refund, Due  ⟨f21⟩  Size of message  Numeric  The size of smishing messages and legitimate messages are different  ⟨f22⟩  E-mail-id feature  {0,1}  Most of the smishing messages contain the e-mail-id to ask the credentials  ⟨f23⟩  URL feature  {0,1}  A URL is embedded in smishing messages to visit the phishing sites  ⟨f24⟩  Phone number feature  {0,1}  Phone number is sent along with messages so that the attackers contact with victims  ⟨f25⟩  Special character  {0,1}  Most of the smishing messages contain special characters like $ and £ to attract users  ⟨f26⟩  Misspelled word  Numeric  A majority of legitimate brand’s messages never send any misspelled words  ⟨f27…f33⟩  Part of Speech  Numeric  The part of speech of most of the smishing message is distinct from legitimate messages such as noun, pronoun, adjective, verb, adverb, preposition and conjunction  ⟨f34…f39⟩  Readability of text  Numeric  The text style of smishing and legitimate messages are unusual  Features  Feature name  Data-type  Description  ⟨f1,…f20⟩  Words feature:  {0,1}  Smishing messages contain frequent words like please, Sms, Account, Customer, Card, E-mail, Apple, Details, Update, Iphone, Online, Bank, Link, Message, Call, Store, Today, Nationwide, Refund, Due  ⟨f21⟩  Size of message  Numeric  The size of smishing messages and legitimate messages are different  ⟨f22⟩  E-mail-id feature  {0,1}  Most of the smishing messages contain the e-mail-id to ask the credentials  ⟨f23⟩  URL feature  {0,1}  A URL is embedded in smishing messages to visit the phishing sites  ⟨f24⟩  Phone number feature  {0,1}  Phone number is sent along with messages so that the attackers contact with victims  ⟨f25⟩  Special character  {0,1}  Most of the smishing messages contain special characters like $ and £ to attract users  ⟨f26⟩  Misspelled word  Numeric  A majority of legitimate brand’s messages never send any misspelled words  ⟨f27…f33⟩  Part of Speech  Numeric  The part of speech of most of the smishing message is distinct from legitimate messages such as noun, pronoun, adjective, verb, adverb, preposition and conjunction  ⟨f34…f39⟩  Readability of text  Numeric  The text style of smishing and legitimate messages are unusual  View Large Table 2. Summary of the features. Features  Feature name  Data-type  Description  ⟨f1,…f20⟩  Words feature:  {0,1}  Smishing messages contain frequent words like please, Sms, Account, Customer, Card, E-mail, Apple, Details, Update, Iphone, Online, Bank, Link, Message, Call, Store, Today, Nationwide, Refund, Due  ⟨f21⟩  Size of message  Numeric  The size of smishing messages and legitimate messages are different  ⟨f22⟩  E-mail-id feature  {0,1}  Most of the smishing messages contain the e-mail-id to ask the credentials  ⟨f23⟩  URL feature  {0,1}  A URL is embedded in smishing messages to visit the phishing sites  ⟨f24⟩  Phone number feature  {0,1}  Phone number is sent along with messages so that the attackers contact with victims  ⟨f25⟩  Special character  {0,1}  Most of the smishing messages contain special characters like $ and £ to attract users  ⟨f26⟩  Misspelled word  Numeric  A majority of legitimate brand’s messages never send any misspelled words  ⟨f27…f33⟩  Part of Speech  Numeric  The part of speech of most of the smishing message is distinct from legitimate messages such as noun, pronoun, adjective, verb, adverb, preposition and conjunction  ⟨f34…f39⟩  Readability of text  Numeric  The text style of smishing and legitimate messages are unusual  Features  Feature name  Data-type  Description  ⟨f1,…f20⟩  Words feature:  {0,1}  Smishing messages contain frequent words like please, Sms, Account, Customer, Card, E-mail, Apple, Details, Update, Iphone, Online, Bank, Link, Message, Call, Store, Today, Nationwide, Refund, Due  ⟨f21⟩  Size of message  Numeric  The size of smishing messages and legitimate messages are different  ⟨f22⟩  E-mail-id feature  {0,1}  Most of the smishing messages contain the e-mail-id to ask the credentials  ⟨f23⟩  URL feature  {0,1}  A URL is embedded in smishing messages to visit the phishing sites  ⟨f24⟩  Phone number feature  {0,1}  Phone number is sent along with messages so that the attackers contact with victims  ⟨f25⟩  Special character  {0,1}  Most of the smishing messages contain special characters like $ and £ to attract users  ⟨f26⟩  Misspelled word  Numeric  A majority of legitimate brand’s messages never send any misspelled words  ⟨f27…f33⟩  Part of Speech  Numeric  The part of speech of most of the smishing message is distinct from legitimate messages such as noun, pronoun, adjective, verb, adverb, preposition and conjunction  ⟨f34…f39⟩  Readability of text  Numeric  The text style of smishing and legitimate messages are unusual  View Large 3.2. Machine learning classifiers Several machine classification techniques are used to classify smishing messages from legitimate messages. The classifiers learn from a set of features, which is called training datasets and predict the output. In this scenario, the model classifies the messages, whether smishing or legitimate by learning the features from smishing and legitimate messages. In this paper, the proposed method initially evaluated the performances of four well-known classifiers such as random forest [49–52], Decision Tree [53], Support Vector Machine [54–56] and AdaBoost [57–59] in given set of data. Subsequently, the model selects the classifier which has the superior performance and applies the classifier to the features selection algorithm. 3.2.1. Cross-validation Cross-validation validates the classifiers by partitioning the dataset into complementary subsets where one subset is chosen as test dataset and others are chosen as training dataset. The k-fold cross-validation divides the dataset into k equal subsets and each time one subset of k is taken as a test, and other k−1 subsets are assembled to frame a training set. Assume, nk={c1,c2,…ck} be the K parts where nk←NK. The equation of cross-validation (CV) is shown in the following equation:   CVk=1K∑i=1kEi (9)where Ei=1nk∑i∈ckk(yi−yi¯)2 be the mean squared error (MSE). 3.3. Pearson correlation coefficient Pearson correlation coefficient (PCC) measures the linear correlation between two features [60, 61]. It evaluates three categories of correlation; positive linear correlation is regarded as 1, no linear correlation is 0 and negative linear correlation is −1. Several researchers used PCC to select relevant features [62, 63]. Assume, X={x1,x2,…xn} and Y={y1,y2,…yn} are two sets of features. The PCC is defined by ρ and equation is shown in the following equation:   ρ(X,Y)=cov(X,Y)σX,σY (10)where cov(X,Y) is the covariance of X, Y and σX is the standard deviation of X and σY is the standard deviation of Y:   cov(X,Y)=1n−1∑i=1n(xi−x¯)(yi−y¯) (11)where x¯ and y¯ are mean of X and Y is given by the following equation:   x¯=1n∑x=1nxi (12)The standard deviation (σ) of is defined by the following equation:   σ(x)=∑i=1n(xi−x¯)2n−1 (13)Using Algorithm 1, the model computes the ranking scores of the features as explained in Section 4.3. 3.4. SmiDCA prototype model This section provides a description and evaluation of the Prototype model of SmiDCA. The high-level architecture of the model is shown in Fig. 7 and the specifications of the machine are the following: Operating System: Ubuntu 17.10 (64-bit), Processor: Intel Core i3 CPU 530, RAM: 4 GB and Disk: 75 GB. This model is developed with Python code, hence; the interface is designed using Tkinter package. Figure 7. View largeDownload slide Smidca prototype. Figure 7. View largeDownload slide Smidca prototype. Figure 7 shows that the model has two interfaces: one is client interface, and another is the system interface. The client interface displays the telephone number of the senders as shown in Fig. 8(a) and if the sender’s number is already in the user telephone diary, then the users can directly observe the message by tapping the Skip button otherwise the user can verify the message by tapping the Verify button. On tapping the verify button the proposed model receives the message as input and sends to the machine learning classifier. The classifier returns the feedback of the message, whether smishing or legitimate to the system interface. Figure 8(b) shows that the message is recognized as legitimate in the system interface. Figure 8. View largeDownload slide (a) Legitimate message and (b) SmiDCA identified as legitimate. Figure 8. View largeDownload slide (a) Legitimate message and (b) SmiDCA identified as legitimate. The system interface provides two buttons: Go Back button and Proceed button. The Go Back button deletes the messages without sending to the user inbox, and the Proceed button allows users to access the messages. As the message was identified as legitimate by the proposed model, therefore, the users could access the message. Subsequently, we verified the model with a smishing message as shown in Fig. 9(a) and the model detected the message as smishing as shown in Fig. 9(b) and the guideline was provided to the user to handle to the buttons. The source code of the prototype implementation is uploaded to GitHub Repository and it is available in the following URL: https://github.com/gsonowal20/SmiDCA. Figure 9. View largeDownload slide (a) Smishing message and (b) SmiDCA detected the smishing message. Figure 9. View largeDownload slide (a) Smishing message and (b) SmiDCA detected the smishing message. The objective of the prototype implementation is to demonstrate the SmiDCA approach. In the present state, the User Interface of the prototype is not optimized. One of the important interface-related obstacles indicated in the feedbacks received from a pilot study is the need to perform two taps (a tap on Verify button and a tap on either Go Back or Proceed button). This two tap process can be converted into a single tap process with a change in work-flow which has been pipelined for implementation in the next version. 4. RESULT EVALUATION 4.1. Data collection We have collected data of English SMS from Almeida [64] as shown in Table 3. Table 3. English text messages. Total SMS  Smishing SMS  Legitimate SMS  5578  747  4831  Total SMS  Smishing SMS  Legitimate SMS  5578  747  4831  View Large Table 3. English text messages. Total SMS  Smishing SMS  Legitimate SMS  5578  747  4831  Total SMS  Smishing SMS  Legitimate SMS  5578  747  4831  View Large In addition, we gathered no-English data from Yadav et al. [65] as shown in Table 4. Table 4. Non-English text message. Total SMS  Smishing SMS  Legitimate SMS  1893  898  995  Total SMS  Smishing SMS  Legitimate SMS  1893  898  995  View Large Table 4. Non-English text message. Total SMS  Smishing SMS  Legitimate SMS  1893  898  995  Total SMS  Smishing SMS  Legitimate SMS  1893  898  995  View Large 4.2. Feature extraction From the smishing messages, the model extracted 39 different features as shown in Table 2 and the model employed machine learning algorithms to classify smishing messages from legitimate messages. Therefore, the model generated a matrix where each row corresponds to messages, and each column corresponds to the features. Each cell in the matrix represents the value of the corresponding features in the corresponding message. Assume that D=d1,d2,…,dn denotes the n documents and F=f1,f2,…,fn be the n feature vector space. Consider the cij is the value of jth feature of ith document. Therefore, each document represents Ci=(ci1,ci2,…,cin) and each feature represent Cj=(c1j,c2j,…,cnj) where i=1,2,…F and j=1,2,…D. 4.3. Feature selection algorithm There are R dimensional features in feature spaces. Assume that F={fi∣i=0,1,2,…n} be the original set of features where F⊆Rn and F≠0. In this way, the model evaluated the PCC scores ( Y) of all the features with the decision attribute ( di) using the following equation:   yi←PCC(fi,d) (14)where fi∈F and yi∈Y, Y={y0,y1,y2…yn} and di denotes the level of fi; i=1,2,…n. On the preface of the score, the model employed the sequential forward algorithm [66–68] which is shown in Algorithm 1. Algorithm 1 SmiDCA algorithm   Input: input Y={y0,y1,y2…yn}, x←∅, k←0, flag←0, Tacc←0, Tmax:threshold    Output: Xk={xj∣j=1,2,…k;xj∈Y}  1  x+←max(yi)  2  xk+1←xk+x+  3  Cacc←Acc(xk+1)  4  if Cacc>Taccthen  5678  flag←0Tacc←Cacck←k+1Gotostep1  9  else  10111213141516  ifflag>TmaxthenXk←CaccreturnXkelseflag←flag+1Gotostep1end  17  end    Input: input Y={y0,y1,y2…yn}, x←∅, k←0, flag←0, Tacc←0, Tmax:threshold    Output: Xk={xj∣j=1,2,…k;xj∈Y}  1  x+←max(yi)  2  xk+1←xk+x+  3  Cacc←Acc(xk+1)  4  if Cacc>Taccthen  5678  flag←0Tacc←Cacck←k+1Gotostep1  9  else  10111213141516  ifflag>TmaxthenXk←CaccreturnXkelseflag←flag+1Gotostep1end  17  end  Algorithm 1 SmiDCA algorithm   Input: input Y={y0,y1,y2…yn}, x←∅, k←0, flag←0, Tacc←0, Tmax:threshold    Output: Xk={xj∣j=1,2,…k;xj∈Y}  1  x+←max(yi)  2  xk+1←xk+x+  3  Cacc←Acc(xk+1)  4  if Cacc>Taccthen  5678  flag←0Tacc←Cacck←k+1Gotostep1  9  else  10111213141516  ifflag>TmaxthenXk←CaccreturnXkelseflag←flag+1Gotostep1end  17  end    Input: input Y={y0,y1,y2…yn}, x←∅, k←0, flag←0, Tacc←0, Tmax:threshold    Output: Xk={xj∣j=1,2,…k;xj∈Y}  1  x+←max(yi)  2  xk+1←xk+x+  3  Cacc←Acc(xk+1)  4  if Cacc>Taccthen  5678  flag←0Tacc←Cacck←k+1Gotostep1  9  else  10111213141516  ifflag>TmaxthenXk←CaccreturnXkelseflag←flag+1Gotostep1end  17  end  The sequential forward algorithm begins with an empty set of features ( x←∅) which subset size is zero ( k←0) and subsequently adds features one by one. In the algorithm, the model adds the features which have the highest linear correlation coefficient score ( x+←max(yi)); however, the significant downside of this algorithm, which places to stop, as there is the extensive number of features in the feature’s corpus. Therefore, the model incorporates machine learning algorithm and manages one threshold ( Tmax). The model initially allots target accuracy score as zero i.e. ( Tacc←0), subsequent to adding new features to the set ( xk) and evaluates the accuracy ( Cacc←Acc(xk+1)). If the current accuracy score is more than the target accuracy score, then assigns the current accuracy score to the target accuracy ( Tacc←Cacc) otherwise increase the flag value. If the flag value is above than the threshold (Tmax), then terminate the loop and return the best features set. 4.4. Evaluation metrics The proposed method employs a set of metrics to measure the performance using four machine learning classifications. Assumed, Nham is the number of legitimate messages and Nphish is the number of smishing messages. The four parameters are used to compute the metrics that is, Nphish→phish=TP: number of smishing messages correctly classified to smishing messages, Nham→ham=TN : number of legitimate messages correctly classified to legitimate messages, Nphish→ham=FP: number of phishing messages are classified to legitimate messages, Nham→phish=FN: number of legitimate messages are classified to smishing messages. The four performance metrics are shown below: Accuracy: Overall correctly classified is shown in the following equation:   Accuracy=TP+TNTP+TN+FP+FN (15) Precision: The precision is defined as the number of true positives ( TP) over the number of true positives ( TP) plus the number of false positives ( FP) is shown in the following equation:   Precision=TPTP+FP (16) Recall: The recall is defined as the number of true positives (TP) over the number of true positives ( TP) plus the number of false negatives ( FN) is given in the following equation:   Recall=TPTP+FN (17) F1-score: The F1-score is defined as the harmonic mean of precision and recall is given in the following equation:   F1-score=2Precision.RecallPrecision+Recall (18) 4.5. Experimental result Once the feature extraction has been completed, the model evaluated the performance using two algorithms; one is Before feature selection algorithm (BFSA) which selects all the features without selecting relevant or irrelevant features and evaluates the performances. The second one is After features selection algorithm (AFSA) employs the feature selection algorithm and evaluates the performances. The model evaluated the performance using 10-fold cross-validation (Section 3.2.1) of four well-known machine learning classifiers as explained in Section 3.2. At first, the model evaluated the performances of BFSA and the result of the experiment is shown in Table 5. The result of the BFSA shows that random forest classifier has evaluated superior accuracy among other’s classifiers. Therefore, random forest classifier was chosen in the second algorithm that is AFSA, and comparison was done between BFSA and AFSA based on the dimension of features and performance with the same learning algorithm. Table 5. Performance of the BFSA. Classifiers  Precision  Recall  F1-scores  Accuracy  Random Forest classifier  94.71  77.4  85.1  96.40  DecisionTree classifier  81.23  80.88  81.0  94.91  AdaBoost classifier  86.98  80.75  83.68  95.79  Support vector machine classifier  87.78  78.21  82.64  95.59  Classifiers  Precision  Recall  F1-scores  Accuracy  Random Forest classifier  94.71  77.4  85.1  96.40  DecisionTree classifier  81.23  80.88  81.0  94.91  AdaBoost classifier  86.98  80.75  83.68  95.79  Support vector machine classifier  87.78  78.21  82.64  95.59  View Large Table 5. Performance of the BFSA. Classifiers  Precision  Recall  F1-scores  Accuracy  Random Forest classifier  94.71  77.4  85.1  96.40  DecisionTree classifier  81.23  80.88  81.0  94.91  AdaBoost classifier  86.98  80.75  83.68  95.79  Support vector machine classifier  87.78  78.21  82.64  95.59  Classifiers  Precision  Recall  F1-scores  Accuracy  Random Forest classifier  94.71  77.4  85.1  96.40  DecisionTree classifier  81.23  80.88  81.0  94.91  AdaBoost classifier  86.98  80.75  83.68  95.79  Support vector machine classifier  87.78  78.21  82.64  95.59  View Large In AFSA, the model selected relevant features using correlation algorithms is explained in Section 3.3. Subsequently, the model employed the sequential forward algorithm (as shown in Algorithm 1) to ascertain the best feature subset. In this algorithm, the model was adding features according to the top relevant features and stopped adding at that place where no improvement has happened. This means that after that point, the accuracy was continuously decreasing. Figure 10 shows the performance of AFSA. In this FIGURE, the x-axis shows the feature addition, and the y-axis shows the performance of the feature’s subsets. Figure 10. View largeDownload slide AFSA performance. Figure 10. View largeDownload slide AFSA performance. The experiment shows that the BFSA selected 39 features to evaluate 96.40% and AFSA selected only 20 features that evaluated the accuracy 96.16% (as shown in Table 6). This table demonstrates that both the algorithms evaluated the similar accuracy; however, the BFSA evaluated slightly greater than AFSA. The rate of reduction of the dimension of features is shown in the following equation:   Rateofreduction=Tf−TrTf*100 (19)where Tf is the original dimension of features, Tr be the dimension of features was reduced by feature’s selection algorithm. Using this equation, we could compute the rate of reduction=(39−2039)*100=48.72%. Therefore, the feature selection algorithm reduced the feature dimensions is 48.72%. Table 6. Comparison between BFSA and AFSA. Features selection Algorithms  Features are selected for classification  Number of features used in classification  Accuracy (%)  BFSA  {f1,f2,f3……f38,f39}  39  96.40  AFSA  {f24,f39,f15,f34,f38,f29,f23,f1,f3,f9, f30,f17,f32,f4,f36,f27,f37,f21,f33,f14}  20  96.16  Features selection Algorithms  Features are selected for classification  Number of features used in classification  Accuracy (%)  BFSA  {f1,f2,f3……f38,f39}  39  96.40  AFSA  {f24,f39,f15,f34,f38,f29,f23,f1,f3,f9, f30,f17,f32,f4,f36,f27,f37,f21,f33,f14}  20  96.16  View Large Table 6. Comparison between BFSA and AFSA. Features selection Algorithms  Features are selected for classification  Number of features used in classification  Accuracy (%)  BFSA  {f1,f2,f3……f38,f39}  39  96.40  AFSA  {f24,f39,f15,f34,f38,f29,f23,f1,f3,f9, f30,f17,f32,f4,f36,f27,f37,f21,f33,f14}  20  96.16  Features selection Algorithms  Features are selected for classification  Number of features used in classification  Accuracy (%)  BFSA  {f1,f2,f3……f38,f39}  39  96.40  AFSA  {f24,f39,f15,f34,f38,f29,f23,f1,f3,f9, f30,f17,f32,f4,f36,f27,f37,f21,f33,f14}  20  96.16  View Large The model computed the precision and the recall of both BFSA and AFSA. F1-Measure was used to evaluate the efficiency is shown in Fig. 11. Although the accuracy of BFSA is marginally ahead of AFSA, the efficiency of AFSA is better than BFSA. Figure 11. View largeDownload slide Efficiency of both the algorithms. Figure 11. View largeDownload slide Efficiency of both the algorithms. 4.6. Performance of non-English dataset Figure 12 shows the performance of the proposed model on a standard Indian language dataset [65]. In the experiment, we have found that the model achieved the accuracy 90.33% with Random Forest algorithm, which is slightly lesser than English language dataset. However, we compared the performance of both datasets of English and non-English text messages, which are illustrated in Fig. 13. It can be inferred from the metrics that proposed model has the potential to handle non-English language as well with an acceptable accuracy level. Figure 12. View largeDownload slide Non-English text-messages performance. Figure 12. View largeDownload slide Non-English text-messages performance. Figure 13. View largeDownload slide Comparison between English and non-English performances. Figure 13. View largeDownload slide Comparison between English and non-English performances. 5. DISCUSSION This paper has proposed a novel anti-smishing model entitled SmiDCA to detect smishing messages. The primary objective of the proposed model was to classify smishing messages from legitimate messages. The proposed model examined the smishing messages and extracted 39 features. Machines learning models were constructed with and without dimensional reduction of features. The correlation algorithm evaluated the linear correlation score of all the features and arranged according to the highest correlation score. As we have several features, which have the highest correlation scores, therefore, the model employed the sequential forward feature selection algorithm to select the best feature’s subset. The sequential forward feature selection was used to add features one by one. However, the model faced one issue regarding the upper limit on the number of features that shall be added to the set. Therefore, to overcome this issue, the model employed machine learning algorithm with the threshold value as explained in Section 4.3. Our experiment started by incrementally adding features. Initially, with three features, it gave the accuracy of 88.98% (three features). Subsequently, we added three more features and found the accuracy 93.44% (six features) and in this manner, the model gradually increased the feature’s size and evaluated the accuracy as follows: 93.69% (eight features), 94.34% (12 features), 94.71% (15 features), 95.64% (18 features) and finally we had 96.16% with 20 features. If we had added more features to the feature’s set, afterward the accuracy reduced 96.09% (21 features), 96.02% (22 features) and 95.82% (23 features). Therefore, we terminated the investigation because the proposed model maintained the policy that if the model found continuously same accuracies or decrease in accuracy for three iterations, then the expansion process requires to be terminated. The result of the experiment showed that the feature’s subset that has 20 features evaluated superior accuracy. However, the comparison between BFSA and AFSA shows that the BFSA evaluated the accuracy slightly better than AFSA. If we compare the feature quantity, then we have found that the BFSA selected 39 features, and AFSA selected only 20 features. Hence, from the experiments, we came to realize that more features are unnecessary to evaluate the better accuracy, it relies upon the most effective feature’s combination. In addition, the feature selection algorithm reduced the dimension of features 40.71% with the assist of the correlation algorithm and the sequential forward technique for dimensionality reduction. Additionally, if we concentrate on the efficiency of both the algorithms, then we have found that the AFSA offered better performance than BFSA. Moreover, the model carried out the experiment on non-English data as well and evaluated the accuracy 90.33% in Random Forest algorithm as compared with the English dataset 96.40%. In order to verify the performance of the model, a comparative analysis is provided between English and non-English datasets. It was inferred that the model can handle both English and non-English datasets. 6. CONCLUSION Smishing is a critical cyber attack which is increasing in the recent years. Hence, this paper proposed a novel anti-smishing model titled SmiDCA which analyzed the smishing messages and extracted 39 features to detect smishing messages. In addition, four well-known machines learning algorithms were applied to classify the smishing messages from legitimate messages. The result of the experiment shows that the model evaluated the accuracy of 96.40% with the assistance of Random Forest classifier in BFSA. Furthermore, the model experimented with the accuracy with non-English language dataset and evaluated the accuracy 90.33% in Random Forest algorithm. From the results of the investigation, it can be concluded that the model has the capacity to deal with both English and non-English datasets. The model minimized the English-dataset dimension of features by 40.71% and accuracy was evaluated as 96.16% in AFSA. In addition, the efficiency of both the algorithms shows that the AFSA provided better efficiency than BFSA. In future, we will add more features, and use novel feature selection algorithms for enhancing the accuracy. The deep learning methodologies-based approach shall be explored as a future direction. In addition, the user interface of the prototype can also be enhanced further. REFERENCES 1 Michael ( 2016) eretailers: Text consumers or risk irrelevancy in 2017 (accessed on 2017). 2 Statisticbrain ( 2017) Text message statistics (accessed on 2018). 3 Silva, R.M., Alberto, T.C., Almeida, T.A. and Yamakami, A. ( 2017) Towards filtering undesired short text messages using an online learning approach with semantic indexing. Expert Syst. Appl. , 83, 314– 325. Google Scholar CrossRef Search ADS   4 Hong, T., Choi, C. and Shin, J. ( 2018) Cnn-based malicious user detection in social networks. Concurr. Comput. Pract. Exp. , 30, e4163-n/a. e4163 CPE-17-0052.R3. Google Scholar CrossRef Search ADS   5 Mun, H.-J. and Han, K.-H. ( 2016) Blackhole attack: user identity and password seize attack using honeypot. J. Comput. Virol. Hacking Tech. , 12, 185– 190. Google Scholar CrossRef Search ADS   6 Sonowal, G., Kuppusamy, K.S. and Kumar, A. ( 2017) Usability evaluation of active anti-phishing browser extensions for persons with visual impairments. 2017 4th Int. Conf. Advanced Computing and Communication Systems (ICACCS), January pp. 1–6. IEEE. 7 McAfee ( 2012) Protect yourself from smishing (accessed on 2017). 8 genisyscu ( 2017) Smishing - text messaging scams (accessed on 2017). 9 Hiremath, R., Malle, M. and Patil, P. ( 2016) Cellular network fraud & security, jamming attack and defenses. Proc. Comput. Sci. , 78, 233– 240. Google Scholar CrossRef Search ADS   10 Clxcommunications ( 2016) Mobile message fraud report (accessed on 2017). 11 Delany, S.J., Buckley, M. and Greene, D. ( 2012) Sms spam filtering: methods and data. Expert Syst. Appl. , 39, 9899– 9908. Google Scholar CrossRef Search ADS   12 Baglia, M. ( 2015) Text marketing vs. email marketing: which one packs a bigger punch (accessed on 2017). 13 Canova, G., Volkamer, M., Bergmann, C., Borza, R., Reinheimer, B., Stockhardt, S. and Tenberg, R. ( 2015) Learn to spot phishing urls with the android nophish app. IFIP World Conf. Information Security Education, pp. 87–100. Springer. 14 Kang, A., Dong Lee, J., Kang, W.M., Barolli, L. and Park, J.H. ( 2014) Security Considerations for Smart Phone Smishing Attacks. In Jeong, H.Y., Obaidat, M.S., Yen, N.Y. and Park, J.J.J.H. (eds.) Advances in Computer Science and its Applications , pp. 467– 473. Springer, Berlin, Heidelberg. Google Scholar CrossRef Search ADS   15 Shahriar, H., Klintic, T. and Clincy, V. ( 2015) Mobile phishing attacks and mitigation techniques. J. Inf. Secur. , 6, 206. 16 Moon, S.-h. and Park, D.-w. ( 2016) Forensic analysis of mers smishing hacking attacks and prevention. Int. J. Secur. Appl. , 10, 181– 192. 17 Mahmoud, T.M. and Mahfouz, A.M. ( 2012) Sms spam filtering technique based on artificial immune system. IJCSI Int. J. Comput. Sci. Issues , 9, 589– 597. 18 Belabed, A., Aïmeur, E. and Chikh, A. ( 2012) A Personalized Whitelist Approach for Phishing Webpage Detection. 2012 Seventh Int. Conf. Availability, Reliability and Security, August, pp. 249–254. IEEE. 19 Kang, J. and Lee, D. ( 2007) Advanced White List Approach for Preventing Access to Phishing Sites. 2007 Int. Conf. Convergence Information Technology (ICCIT 2007), November, pp. 491–496. IEEE. 20 Sharifi, M. and Siadati, S.H. ( 2008) A Phishing Sites Blacklist Generator. 2008 IEEE/ACS Int. Conf. Computer Systems and Applications, March, pp. 840–843. IEEE. 21 Prakash, P., Kumar, M., Kompella, R.R. and Gupta, M. ( 2010) Phishnet: Predictive Blacklisting to Detect Phishing Attacks. 2010 Proc. IEEE INFOCOM, March, pp. 1–5. IEEE. 22 Gastellier-Prevost, S., Granadillo, G.G. and Laurent, M. ( 2011) Decisive Heuristics to Differentiate Legitimate from Phishing Sites. 2011 Conf. Network and Information Systems Security, May, pp. 1–9. IEEE. 23 Cao, Y., Han, W. and Le, Y. ( 2008) Anti-phishing based on Automated Individual White-List. Proceedings of the 4th ACM Workshop on Digital Identity Management, New York, NY, USA DIM ‘08, pp. 51–60. ACM. 24 Mohammad, R.M., Thabtah, F. and McCluskey, L. ( 2014) Intelligent rule-based phishing websites classification. IET Inf. Secur. , 8, 153– 160. Google Scholar CrossRef Search ADS   25 Sonowal, G. and Kuppusamy, K. ( 2017) Phidma—a phishing detection model with multi-filter approach. J. King Saud Univ. Comput. Inf. Sci. , 29, 1– 15. Google Scholar CrossRef Search ADS   26 Banu, M.N. and Banu, S.M. ( 2013) A comprehensive study of phishing attacks. Int. J. Comput. Sci. Inf. Technol. , 4, 783– 786. 27 Kang, J.W., Lee, A.R. and Kim, B. ( 2016) Improving security awareness about smishing through experiment on the optimistic bias on risk perception. J. Korea Inst. Inf. Secur. Cryptology , 26, 475– 487. Google Scholar CrossRef Search ADS   28 Baslyman, M. and Chiasson, S. ( 2016) ‘Smells phishy?’: An Educational Game about Online Phishing Scams. 2016 APWG Symp. Electronic Crime Research (eCrime), June, pp. 1–11. IEEE. 29 Mun, H.-J. and Li, Y. ( 2017) Secure short url generation method that recognizes risk of target url. Wireless Pers. Commun. , 93, 269– 283. Google Scholar CrossRef Search ADS   30 Baek, M., Lee, Y. and Won, Y. ( 2017) Property Analysis of SMS Spam using Text Mining. In Park, J.J.J.H., Chen, S.-C. and Raymond Choo, K.-K. (eds.) Advanced Multimedia and Ubiquitous Engineering , pp. 67– 73. Springer, Singapore. Google Scholar CrossRef Search ADS   31 Nair, A.E.S. ( 2013) Disributed System for Smishing Detection . Southern Methodist University, Dallas, TX. Google Scholar PubMed PubMed  32 Lee, A., Kim, K., Lee, H. and Jun, M. ( 2016) A Study on Realtime Detecting Smishing on Cloud Computing Environments. In Park, J.J.J.H., Chao, H.-C., Arabnia, H. and Yen, N.Y. (eds.) Advanced Multimedia and Ubiquitous Engineering , pp. 495– 501. Springer, Berlin, Heidelberg. Google Scholar CrossRef Search ADS   33 Foozy, C.F.M., Ahmad, R. and Abdollah, M.F. ( 2013) Phishing detection taxonomy for mobile device. Int. J. Comput. Sci. Issues (IJCSI) , 10, 338– 344. 34 Pandey, M. and Ravi, V. ( 2012) Detecting Phishing E-mails using Text and Data Mining. 2012 IEEE Int. Conf. Computational Intelligence and Computing Research, December, pp. 1–6. IEEE. 35 Joo, J.W., Moon, S.Y., Singh, S. and Park, J.H. ( 2017) S-detector: an enhanced security model for detecting smishing attack for mobile computing. Telecommun. Syst. , 66, 1– 10. Google Scholar CrossRef Search ADS   36 El-Alfy, E.-S.M. ( 2017) Detection of phishing websites based on probabilistic neural networks and k-medoids clustering. Comput. J. , 60, 1– 15. Google Scholar CrossRef Search ADS   37 Pinterest ( 2017) Smishing dataset (accessed on 2017). 38 Shams, R. and Mercer, R.E. ( 2013) Classifying Spam Emails using Text and Readability Features. 2013 IEEE 13th Int. Conf. Data Mining (ICDM), pp. 657–666. IEEE. 39 Han, Y. and Shen, Y. ( 2016) Accurate Spear Phishing Campaign Attribution and Early Detection. Proc. 31st Annu. ACM Symp. Applied Computing, pp. 2079–2086. ACM. 40 Smith, E.A. and Senter, R. ( 1967) Automated readability index. Technical report. 41 Flesch, R. ( 1948) A new readability yardstick. J. Appl. Psychol. , 32, 221. Google Scholar CrossRef Search ADS PubMed  42 Gunning, R. ( 1952) The Technique of Clear Writing . McGraw-Hill, New York. 43 Mc Laughlin, G.H. ( 1969) Smog grading – a new readability formula. J. Read. , 12, 639– 646. 44 Coleman, M. and Liau, T.L. ( 1975) A computer readability formula designed for machine scoring. J. Appl. Psychol. , 60, 283. Google Scholar CrossRef Search ADS   45 Bird, S., Klein, E. and Loper, E. ( 2009) Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit . O’Reilly Media. 46 Bird, S. and Loper, E. ( 2004) Nltk: The Natural Language Toolkit. Proc. ACL 2004 on Interactive Poster and Demonstration Sessions, Stroudsburg, PA, USA ACLdemo ‘04. Association for Computational Linguistics. 47 Duman, S., Kalkan-Cakmakci, K., Egele, M., Robertson, W. and Kirda, E. ( 2016) Emailprofiler: Spearphishing Filtering with Header and Stylometric Features of Emails. Computer Software and Applications Conference (COMPSAC), 2016 IEEE 40th Annual, pp. 408–416. IEEE. 48 Keretna, S., Hossny, A. and Creighton, D. ( 2013) Recognising User Identity in Twitter Social Networks via Text Mining. 2013 IEEE Int. Conf. Systems, Man, and Cybernetics, October, pp. 3079–3082. IEEE. 49 Akinyelu, A.A. and Adewumi, A.O. ( 2014) Classification of phishing email using random forest machine learning technique. J. Appl. Math. , 2014, 6. Google Scholar CrossRef Search ADS   50 Abu-Nimeh, S., Nappa, D., Wang, X. and Nair, S. ( 2007) A Comparison of Machine Learning Techniques for Phishing Detection. Proc. Anti-phishing Working Groups 2Nd Annual eCrime Researchers Summit, New York, NY, USA eCrime ‘07, pp. 60–69. ACM. 51 Ho, T.K. ( 1995) Random Decision Forests. Proc. Third Int. Conf. Document Analysis and Recognition, 1995, pp. 278–282. IEEE. 52 Basnet, R.B. and Sung, A.H. ( 2010) Classifying Phishing Emails using Confidence-Weighted Linear Classifiers. Int. Conf. Information Security and Artificial Intelligence (ISAI), pp. 108–112. IEEE. 53 Toolan, F. and Carthy, J. ( 2009) Phishing Detection using Classifier Ensembles. 2009 eCrime Researchers Summit, Sept, pp. 1–9. IEEE. 54 Fette, I., Sadeh, N. and Tomasic, A. ( 2007) Learning to Detect Phishing Emails. Proc. 16th Int. Conf. World Wide Web, New York, NY, USA WWW ‘07, pp. 649–656. ACM. 55 Huang, H., Qian, L. and Wang, Y. ( 2012) A svm-based technique to detect phishing urls. Inf. Technol. J. , 11, 921– 925. Google Scholar CrossRef Search ADS   56 Yearwood, J., Mammadov, M. and Banerjee, A. ( 2010) Profiling Phishing Emails based on Hyperlink Information. 2010 Int. Conf. Advances in Social Networks Analysis and Mining, August, pp. 120–127. IEEE. 57 Islam, R. and Abawajy, J. ( 2013) A multi-tier phishing detection and filtering approach. J. Netw. Comput. Appl. , 36, 324– 335. Google Scholar CrossRef Search ADS   58 Ramanathan, V. and Wechsler, H. ( 2012) phishgillnet—phishing detection methodology using probabilistic latent semantic analysis, adaboost, and co-training. EURASIP J. Inf. Secur. , 2012, 1. Google Scholar CrossRef Search ADS   59 Ramanathan, V. and Wechsler, H. ( 2013) Phishing detection and impersonated entity discovery using conditional random field and latent dirichlet allocation. Comput. Secur. , 34, 123– 139. Google Scholar CrossRef Search ADS   60 Benesty, J., Chen, J., Huang, Y. and Cohen, I. ( 2009) Pearson Correlation Coefficient. Noise Reduction in Speech Processing , pp. 1– 4. Springer, Berlin, Heidelberg. Google Scholar CrossRef Search ADS   61 Inomata, A., Rahman, M., Okamoto, T. and Okamoto, E. ( 2005) A Novel Mail Filtering Method Against Phishing. PACRIM. 2005 IEEE Pacific Rim Conf. Communications, Computers and signal Processing, 2005. August, pp. 221–224. IEEE. 62 Guyon, I. and Elisseeff, A. ( 2003) An introduction to variable and feature selection. J. Mach. Learn. Res. , 3, 1157– 1182. 63 Guyon, I., Nikravesh, M., Gunn, S. and Zadeh, L.A. ( 2006) An Introduction to Feature Extraction. Feature Extraction: Foundations and Applications . Springer, Berlin, Heidelberg. 64 Almeida, T.A. ( 2017) Ham and spam dataset (accessed on 2017). 65 Yadav, K., Kumaraguru, P., Goyal, A., Gupta, A. and Naik, V. ( 2011) Smsassassin: Crowdsourcing Driven Mobile-based System for SMS Spam Filtering. Proc. 12th Workshop on Mobile Computing Systems and Applications, New York, NY, USA HotMobile ‘11, pp. 1–6. ACM. 66 Mohd, F., Bakar, Z.A., Noor, N.M.M., Rajion, Z.A. and Saddki, N. ( 2015) A Hybrid Selection Method based on HCELFS and SVM for the Diagnosis of Oral Cancer Staging. In Sulaiman, H.A., Othman, M.A., Othman, M.F.I., Rahim, Y.A. and Pee, N.C. (eds.) Advanced Computer and Communication Engineering Technology , pp. 821– 831. Springer International Publishing, Cham. 67 Hassan, D. ( 2015) On determining the most effective subset of features for detecting phishing websites. Int. J. Comput. Appl. , 122, 1– 7. 68 Wang, W.-L., Liu, P.-Y. and Liu, K.-F. ( 2006) Feature selection algorithm in email classification. Jisuanji Gongcheng yu Yingyong (Comput. Eng. Appl.) , 42, 122– 124. Footnotes 1 https://www.cnet.com/news/protect-yourself-from-smishing-video/ Author notes Handling editor: Steven Furnell © The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

The Computer JournalOxford University Press

Published: Apr 25, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off