The unintentional activities of system users can jeopardize the confidentiality, integrity, and assurance of data on information systems. These activities, known as unintentional insider threat activities, account for a significant percentage of data breaches. A method to mitigate or prevent this threat is using smart systems or artificial intelligence (AI). The construction of an AI requires the development of a taxonomy of  activities. The literature review focused on data breach threats, mitigation tools, taxonomy usage in cybersecurity, and taxonomy development using Endnote and Google Scholar. This study aims to develop a taxonomy of unintentional insider threat activities based on narrative descriptions of the breach events in public data breach databases. The public databases were from the California Department of Justice, US Health and Human Services, and Verizon, resulting in 1850 examples of human errors. A taxonomy was constructed to specify the dimensions and characteristics of objects. Text mining and hierarchical cluster analysis were used to create the taxonomy, indicating a quantitative approach. Ward’s agglomeration coefficient was used to ensure the cluster was valid. The resulting top-level taxonomy categories are application errors, communication errors, inappropriate data permissions, lost media, and misconfigurations.


Download data is not yet available.


According to Morgan [1], the expected 2023 annual global cost of cybercrime is approximately $11 trillion. A particular concern for cybersecurity is the “insider threat,” defined by Tsiostas et al. [2] as an individual who has authorized access to organizational assets and information and who acts either maliciously or accidentally in a manner that negatively affects the organization. According to the Ponemon Institute [3], the average cost to resolve all insider threat activities was $15,380,000 per company. The cost per incident from malicious insider threat incidents was $648,062, and the cost per incident for correcting negligent activities was $484,931. The average time to contain an incident was 85 days. Insider threats have been divided into two categories: malicious and unintentional. In its foundational study, the US Computer Emergency Response Team [4] defines the unintentional insider threat as an entity that causes harm or increases the risk of future damage to the confidentiality, integrity, or availability of an information technology enterprise without malicious intent. For 2022, approximately 20% of all data breaches were caused by internal threats and 80% by external threats, with 18% of all breaches caused by unintentional insider actions [5]. Schoenherr and Thomson [6] continue the theme and state that insufficient research has been focused on the unintentional insider threat, with malicious and unintentional threat definitions and mitigations commingled An approach for dealing with the increasing cyber threat is the use of artificial intelligence (AI), with the current trend to apply artificial intelligence to various cybersecurity problems [7]–[12]. Historically, Chandrasekaran [13] was the first to foresee a need for a primitive taxonomy of terms and ontology of actions for the creation of future expert systems. Therefore, one must have a functional taxonomy and ontology to build an expert system.

A taxonomy, or classification scheme with labels, shows relationships between objects, with the relationship often displayed as a structure [14]. In computer science, an ontology was first described by Gruber [15] as a formal specification that defines the concepts and relationships that exist for an agent or a community of agents. Guarino et al. [16] provided a more rigorous definition using modeling language, often referenced in the literature. Olivares-Alarcos et al. [17] show how ontologies support artificial intelligence in their discussion of comparing different ontology-based methods for robot autonomy. Various taxonomies and ontologies for cybersecurity have been proposed or are used in expert systems [18]–[23]. Canham et al. [24] focused on the causes of unintentional and intentional data breaches, providing an overview of human error research and comparing studies for root causes. They determined it is challenging to decide on sources or devise mitigation strategies without a standardized taxonomy. Meanwhile, because of the lack of a taxonomy, databases and reports do not differentiate the different types of non-malicious activities [5]. Database entries are labelled human error but do not use any further delineations. Text mining can be used to create the taxonomy tree, indicating a quantitative approach. Thus, a hierarchical clustering analysis approach using quantitative methods to create and evaluate the artifact was used in this study.

Related Works

Taxonomies for Insider Threats

There have been attempts at creating taxonomies for insider threats, primarily focused on malicious threats. Chaipa et al. [21] developed a combined taxonomy of insider threats, focusing on malicious threats. They combined previous malicious insider threat taxonomies and created trees based on masqueraders, traitors, explorers, pure insiders, and logically present ones. After surveying the literature on malicious insider threats, Al-Mhiqani et al. [25] considered insider threat taxonomy development a problem for future consideration. Canham et al. [24] similarly concluded that cybersecurity professionals require a taxonomy of employees’ unintentional errors to understand root causes and mitigate risks. Homoliak et al. [26] provided initial work on a taxonomy for unintentional insider threats, creating a structure that consisted of slips and mistakes. There was no further decomposition of slips or mistakes. Yeo and Banfield [27] described their findings from evaluating accidental data breaches contained in the Department of Health and Human Services database of data breaches. The problems include email, misplaced hard drives or documents, and accidental uploads to public databases. Unfortunately, they did not take the next step and turn their findings into a taxonomy. However, their observations were included in the developed taxonomy for this research. According to CEOs of Mandiant and CybSafe, the largest insider threat risk is unintentional or accidental acts, not malicious ones. Thus, even though the most significant insider threat risk is unintentional or accidental acts, unintentional threats were underrepresented in the literature and often excluded from the insider threat definition [28].

Ontologies for Insider Threats

Under a contract with IARPA, Greitzer et al. [29] updated and expanded the ontology for insider threats. This work was based on an existing taxonomy of malicious threat activities. However, the effort did not focus on the unintentional insider threat. Unintentional insider threat activities are labeled human error, with no further delineation. By labeling all activities as human error, no mitigations can be defined. Further, although the existing Sociotechnical and Organizational Factors for Insider Threat (SOFIT) ontology class structure provided an initial baseline for a comprehensive description of activities, Greitzer et al. [30] continue to improve it. They included several subject matter experts’ risk assessments to validate the threat assessment model. This approach was necessary since there is little real-world data for malicious insider threat activities, effectively using human observations as a substitute. Canito et al. [31] developed an ontology to improve interoperability between various cybersecurity systems, focusing on critical infrastructure and cyber-physical systems, particularly those systems found at airports. Their work included evaluating the SOFIT model for inclusion in their overall ontology.

Taxonomy Validation

Ralph [32] provided guidance for validating taxonomies. He stated that a taxonomy’s class structure should match the observed data, that a researcher should be able to determine conclusions based on entity class membership, and that a taxonomy should meet its goal. As taxonomies are typically developed using qualitative metrics [33], evaluation is also generally qualitative. The most common approach to verify a taxonomy has been to determine if the taxonomy is germane with new data. Humbatova et al. [34] developed a taxonomy of deep learning systems development problems based on GitHub analysis, Stack Overflow discussions, and interviews with 20 software developers. Their verification of the taxonomy was to interview 20 different software developers to validate the taxonomy’s design and completeness. Lebeuf et al. [35] created a taxonomy for software bots. They validated taxonomy by comparing it to other classifications of software bots, demonstrating its utility by classifying public bots and using experts to determine if the taxonomy was complete and correct. Mountrouidou et al. [36] developed a general taxonomy of Internet of Things (IoT) devices, including devices spanning consumer to industrial and healthcare use. Their validation used set theory to establish completeness, timelessness, and precision. Completeness was defined as ensuring that a new device can be placed into at least one leaf of each major branch of the taxonomy tree and that non-IoT devices cannot. Precision was defined as every device belonging to one and only one leaf, and timelessness meant that categories were generalized sufficiently such that all new types of IoT devices could be included.

Problem Statement, Hypothesis Statements, and Research Questions

Problem Statement

The problem is that cybersecurity professionals cannot cost-effectively build and maintain comprehensive expert systems to mitigate the threat of accidental data breaches since they do not have a standard taxonomy for unintentional insider threat activities [24].

Hypothesis Statement

It is possible to use text mining and hierarchical clustering analysis to create and maintain a standard taxonomy of unintentional threat activities to allow cybersecurity professionals to build and maintain cost-effective and comprehensive expert systems that mitigate the threat of data breaches.

Research Question

How can hierarchical clustering analysis be applied to ensure both the creation and maintenance of a standard taxonomy for unintentional insider threat activities?


Artifact Creation

Text mining was used to examine the relationships and hierarchies between the descriptions of human errors that cause data breaches. Common text mining tools include word clouds, word frequency counts, and cluster dendrograms [37]. Word clouds are a pictorial representation of the frequency of a word, while word frequency counts are the number of times a word occurs. Cluster dendrograms use hierarchical clustering to identify groups in the dataset. Hierarchical clustering does not require a pre-determined number of clusters as the more traditional k-means clustering approach requires. Clustering uses the concept of distance to define how similar two elements are to each other. The classic method of distance measure is Euclidean [38]. Hierarchical clustering can be used to create trees known as dendrograms [39]. Thus, Euclidian distances will be used to calculate the hierarchical clusters that will create the dendrograms. The dendrograms can help shape a taxonomy tree.

According to Nickerson et al. [33], the taxonomy creation steps are iterative. They further suggest ascertaining whether the approach is empirical-to-conceptual or conceptual-to-empirical. Since the data are presented as text descriptions, the approach is empirical-to-conceptual. Fig. 1 is a graphical depiction of the empirical-to-conceptual process for taxonomy creation.

Fig. 1. Empirical-to-conceptual taxonomy creation.

The first step is preparatory and establishes the meta-characteristics of the taxonomy. In other words, determine what is included, what the taxonomy does not contain, and other meta-characteristics. This taxonomy does not include phishing or social engineering attacks since they originate externally, and the human is the victim. Verizon [5] separates human error activities from activities in response to an external threat; in other words, phishing and ransomware attacks are considered separate from pure human mistakes. According to Schlackl et al. [40], human factors can be part of the reason for a data breach and are comprised of social engineering attacks and human error. Social engineering attacks are generated by an external entity, while human error is internal and accidental.

The Verizon report [5] and its database are consistent with Schlackl et al.’s [40] analysis, which separates social engineering attacks from human error. Thus, this taxonomy only includes accidental human errors leading to breaches. The second step is also preparatory and determines the ending conditions. The ending criteria are listed in Table I, summarizing the work of Nickerson et al. [33]. Nickerson et al. [33] also states that the taxonomy should be concise, with the number of dimensions allowing the taxonomy to be meaningful without being unwieldy or overwhelming. A heuristic for this condition is ensuring that the number of dimensions falls in the range of seven plus or minus two [41].

Ending criteria Met?
All objects examined Yes
No objects split or merged Yes
One object per characteristic Yes
New data do not add dimensions Yes
New data do not split dimensions Yes
Dimensions are unique Yes
Characteristics are unique Yes
Cells are unique Yes
Is concise, robust, comprehensive, extendable, explanatory Yes
Table I. Ending Criteria for Taxonomy Creation

The next steps are the mechanics of building a taxonomy tree. Fig. 2 shows the entire process, including creation, validation, and maintenance. Each step will add more data, 100 cases at a time. The first two groups are taxonomy creation and validation, with the last grouping for maintenance. Validation ensures that no changes are needed from the dataset and that the taxonomy describes all the data. The difference between the maintenance and validation phases is that minor lower-level changes are expected. In other words, new stems and leaves can be added, but the top two levels of the taxonomy should stay the same, or a redesign is needed. This approach is similar to maintenance for the biological taxonomy by Linnaeus [42], with new species added but the fundamentals of the taxonomy persisting. A redesign means a taxonomy should be created from the beginning with new data, creating new branches or a new tree.

Fig. 2. Taxonomy creation, validation, and maintenance.

Word clouds, word frequency counts, and a cluster dendrogram will create the initial tree, with later iterations adding subgroups by identifying subsets of objects (i.e., forming the taxonomy tree), delineating common characteristics and group objects (i.e., creating the tree branches), and grouping characteristics into dimensions (i.e., developing the tree stems, with individual activities as leaves). As required, an Agglomerative coefficient will be measured to determine the strength of the clustering in the dendrogram, with a measure close to one implying a strong relationship. Using this coefficient will help delineate the more subtle linkages. The last step is to decide whether the ending conditions are met and iterate until they are achieved. The ending conditions are summarized in Table I.

Population and Sample

The target population for this study is the publicly available data breach databases from Verizon [43], the State of California Department of Justice [44], and the US Department of Health and Human Services (HHS) [45]. Each database is available for research, is continually updated, and contains all the reported breaches for events affecting more than 500 users. The Verizon database contains over 10,000 entries, the HHS database approximately 5000, and the California database around 3500. The HHS database focuses on healthcare breaches, while the Verizon database considers all breaches. The California database is only concerned with California residents. A subset of the databases concerns human error. The Verizon report [5] shows almost 20% of breaches involve human error. Thus, the estimated population is approximately 3700 entries.

However, breach descriptions may not be sufficiently detailed to determine the root cause of the breach event, and the databases may overlap. The goal was to use 1000 entries to derive the initial taxonomy tree and then validate the taxonomy with 500 new events. Therefore, the sample size is 1500 database entries, with 1000 entries used for artifact creation and 500 entries used for validation. Another 350 entries were used to demonstrate artifact maintenance, resulting in a need for 1850 entries. These data were sufficient to create, validate, and maintain the taxonomy.


Dendrogram: A dendrogram is a graphical representation of a hierarchical clustering technique. It is typically plotted as a tree [46].

Hierarchical Clustering: Forina et al. [46] state that hierarchical clustering is used for unsupervised pattern recognition and consists of visual techniques, hierarchical methods of agglomerative and divisive techniques, and non-hierarchical methods. Hierarchical clustering often uses Euclidian distances to determine the clusters and is used as a machine-learning model.

Insider Threat: Tsiostas et al. [2] define an insider threat as someone who has an organization’s credentials and permissions and acts in a way that harms the organization. The insider threat may be malicious or unintentional.

Malicious Insider Threat: A malicious insider threat is an entity that intentionally causes damage, including sabotage, intellectual property theft or disclosure, and release of proprietary or personal information [47]. They are motivated to do harm to an organization.

Ontology: In Computer Science, an ontology is a formal specification that defines the concepts and relationships that exist for an agent or a community of agents [15]. Guarino et al. [16] provided a more rigorous definition using modeling language. An example of how ontologies support artificial intelligence is described by Olivares-Alarcos et al. [17] in their discussion of comparing different ontology-based methods for robot autonomy. In Computer Science, an ontology describes the artificial world of behaviors, user stories, threads, schema, objects, interactions, and hierarchies needed to create an application. Ontologies are part of the Semantic Model for Artificial Intelligence [48].

Semantic Model: In Artificial Intelligence, a semantic model is one where the knowledge representation is a language with specific syntax and semantics. Semantic models include signature language-based, event embedding, and ontology learning [48].

Taxonomy: A taxonomy is a classification scheme with labels. It shows relationships between objects, with the relationship often displayed as a structure [14]. An example is the biological taxonomy developed by Carl Linnaeus [42], consisting of kingdom, phylum, class, order, family, genus, and species. A species can only exist in one hierarchical description, a key criterion for a taxonomy [33].

Unintentional Insider Threat: US Computer Emergency Response Team [4] defines an unintentional insider threat as an entity that causes harm or increases the risk of future damage to an information technology enterprise’s confidentiality, integrity, or availability without malicious intent. This term is often referred to as an Accidental Insider Threat or Negligent Insider Threat. There is no intent to harm the organization. Under the right circumstances, all individuals can become unintentional insider threats. US federal government reports often abbreviate this term as UIT.

Experiment and Results


The data were scraped from the State of California Department of Justice [44] and the US Department of Health and Human Services (HHS) [45] and copied into an Excel spreadsheet using examples of human error. The HHS and California databases were also used for validation since there were over 1800 entries in both databases. The total number of events captured from the HHS, California, and Verizon databases was 1850. The Verizon database and the remainder of the HHS and California databases were used to demonstrate a maintenance capability, using 350 events. The distribution of breaches caused by human errors was Gaussian, with their presence uniformly distributed. Human error events were approximately 20% of the scraped data.

For the creation of the taxonomy, 1000 entries were used from the California and HHS databases. R scripts created word frequency counts, word clouds, and cluster dendrogram plots using references from Paradis [49], the University of Cincinnati [39], and Silge and Robinson [37]. Cluster dendrograms were created using Euclidean distances. The classic method of distance measure is Euclidean [38]. The initial sparsity value is 0.85, which highlights the dominant connections.

Five top-level categories of human error were chosen based on initial analysis: application problems, inappropriate data permissions, misconfigurations, lost media, and communications problems. The key metric is the agglomerative coefficient, which is a measure of clustering [50]. An agglomerative coefficient was calculated for each of these categories that showed the clustering. A clustering coefficient was calculated for each of the common hierarchical clustering methods of single, complete, and Ward’s method [51]. The method that had a coefficient closest to 1.0 was chosen for the dendrogram. In each case, Ward’s approach had a coefficient that was the highest. Therefore, Ward’s method was used to create the dendrogram. Noise words that did not contribute to the taxonomy process were removed by the R code. Noise word examples are months, medicine, and state names.

This process was iterative, using 100 cases on each iteration, creating ten passes through the original 1000 cases. Each pass created or modified the taxonomy design, using the stopping conditions defined in Table I [33] and recording the agglomerative coefficients. Word clouds, word frequency counts, and cluster dendrograms for the first 1000 events are presented in Lost Media in Figs. 3a3c.

Fig. 3. Lost media: a) Word cloud for lost media, b) Word frequency counts for lost media, c) Hierarchical cluster dendrogram for lost media.

Agglomerative coefficients were measured to determine the strength of the clustering in the dendrogram and within the category, with a measure close to one implying a strong relationship. The final agglomerative coefficient for each category is presented in Table II, showing that Ward’s method is the preferred method of clustering. All clustering values are greater than 0.5, with the lowest value of 0.576 for the category Misconfigurations and the highest value of 0.787 for Application Problems.

Category Avg Single Complete Ward
Application problem 0.719 0.731 0.738 0.787
Communications errors 0.588 0.565 0.620 0.674
Inappropriate data permissions 0.535 0.448 0.593 0.708
Lost media 0.407 0.380 0.481 0.584
Misconfiguration 0.459 0.437 0.512 0.576
Table II. Agglomerative Coefficients for Each Category by Type

The most common approach to verify a taxonomy is to determine if the taxonomy is germane with new data [32]. The final taxonomy was evaluated with 500 more entries from CA and the HHS databases, resulting in a total of 1500 entries. The new data did not change the structure of the taxonomy, and all stop conditions established in Table I were met. There were no significant differences in the dendrograms. Another method to demonstrate taxonomy completeness is to generate an ontology. An ontology thread for email communications errors is shown in Fig. 4.

Fig. 4. Ontology thread for email, using taxonomy definitions.

Data were added from the Verizon database to the remaining entries from the CA and the HHS databases to prove the concept of taxonomy maintenance. The Verizon database contained entries from Canada, the United Kingdom, France, India, South Korea, and Japan, as well as the United States. There were no changes to the major portion of the taxonomy other than to add a new type of accidental misconfiguration. The agglomerative coefficients for creation, validation, and maintenance are also presented in Table III.

Category Creation Validation Maintenance
Application problem 0.787 0.694 0.635
Communications errors 0.674 0.680 0.682
Inappropriate data permissions 0.708 0.629 0.636
Lost media 0.584 0.559 0.503
Misconfiguration 0.576 0.557 0.573
Table III. Agglomerative Coefficients for Each Category by Process

The category for Application Problem changed the most with each phase. This is somewhat expected as this category is the most broadly defined with little commonality among problem areas. As this subject area is more researched, it is likely that the category will evolve and possibly break into two or more categories.

The category of Lost Media consists of lost CPUs, lost paper, lost storage devices, and lost smart devices. To demonstrate the clustering in the subcategories, a word cloud, word frequency counts, and a hierarchical dendrogram for lost storage devices are presented in Figs. 5a5c. These data were derived from the maintenance database for Lost Media, highlighting the entries pertaining to lost storage devices. The agglomeration coefficient for this subcategory was calculated to be 0.709 using Ward’s method. The Lost Storage coefficient is higher than the coefficient or all Lost Media, implying this subcategory is better clustered.

Fig. 5. Lost Storage Devices: a) Word cloud for lost storage devices, b) Word frequency counts for lost storage devices, c) Hierarchical cluster dendrogram for lost storage devices.

Taxonomy Description

Taxonomies are often displayed as trees or hierarchical tables [14]. A hierarchical table of the taxonomy after creation, validation, and maintenance is shown in Fig. 6. Three additional items were added in the validation phase (web links and problems with mailing storage devices), while two items were added in the maintenance phase (Zoom and ChatGPT). These are denoted in yellow and blue highlights for validation and maintenance respectively. All the items added were at the sub-sub-category level, meaning the design of the taxonomy is appropriate. The taxonomy consists of five top-level categories: application problems, inappropriate data permissions, lost media, misconfigurations, and communications problems.

Fig. 6. Taxonomy after creation, validation, and maintenance.

Defining three classes of actors (user, maintainer, developer) is helpful in looking forward to an eventual ontology. A user is one with limited privileges. A maintainer ensures configuration settings are correct while a developer writes code. The difference between misconfiguration and an application problem is complexity; application problems include coding errors or privileged user activities. Table IV provides a mapping between the taxonomy categories and actor classes.

Category User Maint Developer
Application problem
All sub categories N N Y
Communications errors
Physical mail N Y Y
Email Y Y N
Unapproved method Y Y Y
Dropbox Y Y N
Printers/Faxes Y Y N
Inappropriate data permissions
Inapprop employee permissions N Y Y
Training apps with real data N Y Y
Password sharing Y Y Y
Incorrectly set collaboration tools N Y Y
Family member involvement Y Y Y
Incorrect physical access N Y Y
Storing data to unapproved device Y Y Y
Lost media
All sub categories Y Y Y
All sub categories N Y Y
Table IV. Actor Class Mapped to Subcategory

Application problems consist of software errors on websites and mobile applications that leak data. The cause is developer error. A major area of concern is business analytics and digital tracking, which are often embedded on a website, as these functions can cause data leaks. The integration of mobile applications and web applications was also observed to lead to software problems, causing data breaches. Java Script Object Notation (JSON) is widely used and allows easy and text-readable data transport. The text readable feature can also create accidental data leakage. The application coder may accidentally leave admin privileges open, leaking data. The application coder may use third-party software that leaks data. Lastly, as the types of software errors is somewhat infinite, there is a miscellaneous category for other types of application errors.

Communication errors include problems with physical mail, email, unapproved communication methods, and printer/fax errors. The significant problems with physical mail are inappropriate recipients, labels containing sensitive information such as PII, and using postcards where the content is sensitive. Examples are mass mailing to the wrong individuals, mailing labels that contain social security numbers, and postcards that discuss medical procedures. Physical mail can also contain storage media that is sent to the wrong addresses or contains sensitive hidden data that the sender did not perceive. Email can have wrong addresses, wrong attachments, incorrect use of the blind copy function, unencrypted email, or can be accidentally sent to a home address. For example, it is common for organizations to use a standard corporate email address (firstname.lastname@company.com). It is also common for users to use a similar standard for their home address (firstname.lastname@gmail.com). Confusing these two addresses is expected, which can cause data breaches. Enterprises also control forms of communication for their sensitive data and disallow the use of other applications such as WhatsApp and cell phone text messages. Printers and faxes can also have the wrong address, with remote work printer addressing problematic.

Inappropriate data permissions occur when an entity is granted privileges that it should not have. This category is for problems within an enclave or enterprise. An example would be an employee who transfers to another position but retains the data permissions from their previous role. This category also includes physical access to a secure location. Another common problem is training scenarios where live data are used, but the trainees do not have the appropriate role to see the live data. Several problems with collaboration tools were noted. Collaboration tools include Sharepoint, Zoom, Shared Drives, or other custom applications. Other problems occur when a user mistakes their circle of trust with the enterprise’s circle of trust. In other words, the enterprise does not recognize family or personal belongings as part of the circle, but the user mistakenly does. This blurring between personal and professional circles results in password sharing, family member involvement, or storing data on personal devices. An example would be a person requiring help with spreadsheet formatting. The person asks his spouse for help and provides her with the file. He is unaware the file has sensitive data that his spouse should not see. Another example would be storing data on a personal device, not realizing the data sensitivity. Both cases reflect confusing a personal circle of trust with those of the enterprise.

Lost media are central processing units (CPUs), smart devices, memory storage devices, or paper that contain sensitive data and are lost. CPUs consist of laptops, servers, desktop computers, tablets, and smartphones. Smart devices are those devices that are not categorized as CPUs but can contain sensitive data. Examples of smart devices include medical scanners and iris scanners. Memory storage devices include flash cards, external hard drives, Universal Serial Bus (USB) flash drives, compact discs, digital video disks (DVDs), tapes, and floppy disks. The paper category includes all paper-based media (e.g., folders, letters, reports, day-planners). Organizational moves to a new physical place often cause misplaced or lost media, with improper disposal of paper media a common mechanism for losing paper products.

Misconfigurations occur when accidental data leaks to the public, resulting from relatively simplistic configuration problems. There are two large sub-categories: unsecured public databases within the cloud and an organization accidentally making its internal databases public. There are numerous examples of unsecured databases and misconfigured Amazon S3 buckets in the cloud. Organizations can accidentally make their data public by posting data on a public repository such as GitHub or Docker, providing data to an AI tool such as ChatGPT and Google Bard, placing files on a publicly accessible server, creating web links that allow public access to servers, and misconfigured FTP sites. Other sub-categories include incorrect or missing firewall and router settings, making a test or training platform publicly accessible but with actual data, and unsecured remote employee connections.


The concept of using text mining tools for taxonomy development is practical and repeatable. Word clouds and frequency counts can establish high-level taxonomy categories. An agglomerative coefficient can quantitatively measure how much a subject area is clustered and can be used to provide metrics for when a clustering area may need revisions. As this is a relatively new area of research, the scientific discussion of an acceptable clustering value for text mining is somewhat vague, other than close to one is strongly clustered. For this research, an arbitrary lower value of 0.5 is considered acceptable. Lower values of 0.1 would indicate that a cluster is too broad and should be split. Agglomerative coefficients can be used to evaluate the addition of new data, whether for taxonomy validation or maintenance, to determine if the new data significantly changes the clustering. The top-level categories of Application Problems, Inappropriate Data Permissions, Lost Media, Misconfiguration, and Communications Errors maintained their position as top-level categories, containing clustering coefficients greater than 0.5. Table V displays the total incidents by top-level category for each database.

Category HHS CA Verizon Total
Application problem 53 70 5 128
Communications errors 511 148 7 666
Inapprop data permissions 187 29 0 216
Lost media 291 212 4 507
Misconfiguration 155 155 24 334
Total 1197 614 40 1851
Table V. Data Breach Incidents for Each Category by Database

The HHS database had missing descriptions for an additional 330 entries. Those entries are not reflected in the above table. The dominant problems are lost media and communication errors. The major concerns in the lost media area are lost storage devices and lost portable computers (e.g., smartphones and laptops). Media is often misplaced during office moves or office remodeling. Damage to a storage facility may also expose data. The major problem concerns for communications were email misaddressing and physical mailing mislabeling.

Emerging problems in the application area are business data analytics and the incorporation of unapproved software tools. There were several instances of websites accidentally capturing sensitive data, particularly on medical websites. Software tools, such as cloud-based transcription services, are potentially problematic and can create data spills. An emerging problem for inappropriate permissions was training or testing with real data versus anonymized data. Unfortunately, when the test or training platform is inappropriately protected, any data are spilled. The latest problem area for lost media is smart devices. These devices may capture significant amounts of data, with potentially damaging consequences if lost. They are also typically unencrypted. As devices become more intelligent, this will become an area of concern. The emerging area of concern for misconfigurations is when public repositories or Artificial Intelligence tools (e.g., GitHub and ChatGPT) are used for development or problem-solving, leading to live data being accidentally spilled. In the communications problem area, communications via modern text messaging systems such as WhatsApp and Signal are emerging areas of concern as these systems are often unmanaged. These newer systems allow large file transfers over not necessarily secure systems. Data spills are already occurring. Other forms of communication (e.g., Slack, Teams, Dropbox, and other corporate forums) are also potential areas of concern.


An advantage of categorizing unintentional insider threat activities is the possibility of creating tailored mitigation strategies. In this paper, the theory that a taxonomy could be created, validated, and maintained using text mining and hierarchical clustering analysis techniques was demonstrated for unintentional insider threat activities. As the method used in this paper is general, therefore this approach could be expanded to other fields that require taxonomies. A standard taxonomy is necessary to create an ontology, enabling the development of artificial intelligence and machine learning tools. Ward’s agglomeration coefficient can be used, using values greater than 0.5 to ensure the clustering is valid. This approach removes some of the subjectivity of taxonomy creation and allows taxonomies to be maintained.


  1. Morgan S. Cybercrime to cost the world $8 trillion annually in 2023. Cybercrime Magazine; 2022. [October 17; cited 2023 November 20]. Available from: https://cybersecurityventures.com/cybercrime-to-cost-the-world-8-trillion-annually-in-2023/.
     Google Scholar
  2. Tsiostas D, Kittes G, Chouliaras N, Kantzavelou I, Maglaras L, Douligeris C, et al. The insider threat: Reasons, effects and mitigation techniques. 24th Pan-Hellenic Conference on Informatics, pp. 340–5, Athens, Greece: Association for ComputingMachinery; November 20–22 2020. doi: 10.1145/3437120.3437336.
     Google Scholar
  3. Ponemon Institute. 2022 cost of insider threats global report. 2022. [cited 2023 November 20]. Available from: https://www.proofpoint.com/sites/default/files/threat-reports/pfpt-us-tr-the-cost-of-insiderthreats-ponemon-report.pdf.
     Google Scholar
  4. CERT Insider Threat Team. Unintentional Insider Threats: A Foundational Study. Software Engineering Institute; 2013. doi:10.1184/R1/6585575.v1.
     Google Scholar
  5. Verizon. 2008 to 2022 data breach investigations report. Available from: https://www.verizon.com/business/resources/T705/reports/dbir/2022-data-breach-investigations-report-dbir.pdf (accessed 2023).
     Google Scholar
  6. Schoenherr JR, Thomson R. The cybersecurity (CSEC) questionnaire: individual differences in unintentional insider threat behaviours. Proceedings of the 2021 International Conference on Cyber Situational Awareness, Data Analytics and Assessment, pp. 1–8, Dublin, Ireland, June 14–18 2021. doi: 10.1109/CyberSA52016.2021.9478213.
     Google Scholar
  7. Apruzzese G, Laskov P, Montes de Oca E, Mallouli W, Rapa LB, Grammatopoulos AV, et al. The role of machine learning in cybersecurity. Digit Threats. 2022;4(1):1–38. doi: 10.1145/3545574.
     Google Scholar
  8. Ali A, Septyanto AW, Chaudhary I, Hamadi HA, Alzoubi HM, Khan ZF. Applied artificial intelligence as event horizon of cyber security. 2022 International Conference on Business Analytics for Technology and Security (ICBATS), pp. 1–7, Dubai, United Arab Emirates, February 16–17 2022. doi: 10.1109/ICBATS54253.2022.9759076.
     Google Scholar
  9. Capuano N, Fenza G, Loia V, Stanzione C. Explainable artificial intelligence in cybersecurity: a survey. IEEE Access. 2022;10:93575–600. doi: 10.1109/ACCESS.2022.3204171.
     Google Scholar
  10. Chan L, Morgan I, Simon H, Alshabanat F, Ober D, Gentry J, et al. Survey of AI in cybersecurity for information technology management. Proceedings of 2019 IEEE Technology & Engineering Management Conference (TEMSCON), pp. 1–8, 2019. doi:10.1109/TEMSCON.2019.8813605.
     Google Scholar
  11. Rani V, Kumar M, Mittal A, Kumar K. Artificial intelligence for cybersecurity: recent advancements, challenges and opportunities. In Robotics and AI for Cybersecurity and Critical Infrastructure in Smart Cities. Nedjah N, Abd El-Latif AA, Gupta BB, Mourelle LM, Eds. Cham: Springer International Publishing, 2022, pp. 73–88.
     Google Scholar
  12. Zhao L, Zhu D, Shafik W, Matinkhah SM, Ahmad Z, Sharif L, et al. Artificial intelligence analysis in cyber domain: a review. Int J Distrib Sens Netw. 2022;18(4):15501329221084882. doi:10.1177/15501329221084882.
     Google Scholar
  13. Chandrasekaran B. Towards a taxonomy of problem solving types. AI Mag. 1983;4(1):9. doi: 10.1609/aimag.v4i1.383.
     Google Scholar
  14. Vegas S, Juristo N, Basili VR. Maturing software engineering knowledge through classifications: a case study on unit testing techniques. IEEE Trans Softw Eng. 2009;35(4):551–65. doi:10.1109/TSE.2009.13.
     Google Scholar
  15. Gruber TR. A translation approach to portable ontology specifications. Knowl Acquisit. 1993;5(2):199–220. doi:10.1006/knac.1993.1008.
     Google Scholar
  16. Guarino N, Oberle D, Staab S.What is an ontology?. In Handbook on Ontologies. Staab S, Studer R, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 1–17.
     Google Scholar
  17. Olivares-Alarcos A, Beßler D, Khamis A, Goncalves P, Habib MK, Bermejo-Alonso J, et al. A review and comparison of ontology-based approaches to robot autonomy. Knowl Eng Rev. 2019;34(e29):e29. doi: 10.1017/S0269888919000237.
     Google Scholar
  18. Akbar KA, Halim SM, Singhal A, Abdeen B, Khan L, Thuraisingham B. The design of an ontology for ATT&CK and its application to cybersecurity. Proceedings of the Thirteenth ACMConference on Data and Application Security and Privacy, pp. 295–7, 2023. doi:10.1145/3577923.3585051.
     Google Scholar
  19. Mohsen F, Zwart C,Karastoyanova D, GaydadjievG.Ataxonomy for large-scale cyber security attacks. EAI Endorsed Trans Cloud Syst. 2022;7(21):e5. doi: 10.4108/eai.2-3-2022.173548.
     Google Scholar
  20. Bahsi H, Dola HO, Khalil SM, Korõtko T. A cyber attack taxonomy for microgrid systems. 2022 17th Annual System of Systems Engineering Conference (SOSE), pp. 324–31, Rochester, NY, June 7–11 2022. doi: 10.1109/SOSE55472.2022.9812642.
     Google Scholar
  21. Chaipa S, Ngassam EK, Shawren S. Towards a new taxonomy of insider threats. 2022 IST-Africa Conference (IST-Africa), pp. 1–10, 2022. doi: 10.23919/IST-Africa56635.2022.9845581.
     Google Scholar
  22. Gupta SB, Mohanty JR, Kumar PP. Taxonomy of cyber security metrics to measure strength of cyber security. Mater Today: Proc. 2023;80(3):2274–9. doi: 10.1016/j.matpr.2021.06.228.
     Google Scholar
  23. Villalón-Huerta A, Ripoll-Ripoll I,Marco-Gisbert H. A taxonomy for threat actors’ delivery techniques. Appl Sci. 2022;12(8):3929. doi: 10.3390/app12083929.
     Google Scholar
  24. Canham M, Posey C, Bockelman PS. Confronting information security’s elephant, the unintentional insider threat. International Conference on Human-Computer Interaction, pp. 316–34, Cham: Springer, Cham; 2020.
     Google Scholar
  25. Al-Mhiqani MN, Ahmad R, Zainal Abidin Z, Yassin W, Hassan A, Abdulkareem KH, et al. A review of insider threat detection: Classification, machine learning techniques, datasets, open challenges, and recommendations. Appl Sci. 2020;10(15):5208. doi:10.3390/app10155208.
     Google Scholar
  26. Homoliak I, Toffalini F, Guarnizo J, Elovici Y, Ochoa M. Insight into insiders and IT: a survey of insider threat taxonomies, analysis, modeling, and countermeasures.ACMComput Surv. 2019;52(2):30. doi: 10.1145/3303771.
     Google Scholar
  27. Yeo LH, Banfield J. Human factors in electronic health records cybersecurity breach: an exploratory analysis, (in eng). Perspect Health InfManag. 2022;19(Spring):1i. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9123525/.
     Google Scholar
  28. Brumfield C. Recent cases highlight need for insider threat awareness and action. CSO Online. 2022, September 29. Available from: https://www.csoonline.com/article/3675348/recent-cases-highlight-need-for-insider-threat-awareness-and-action.html.
     Google Scholar
  29. Greitzer FL, Lee JD, Purl J, Zaidi AK. Design and implementation of a comprehensive insider threat ontology. Procedia Comput Sci. 2019;153:361–9. doi: 10.1016/j.procs.2019.05.090.
     Google Scholar
  30. Greitzer FL, Purl J, Sticha PJ, Yu MC, Lee J. Use of expert judgments to inform Bayesian models of insider threat risk.JWirel Mobile Netw, Ubiquitous Comput Dependable Appl. 2021;12(2):3–47. doi: 10.22667/JOWUA.2021.06.30.003.
     Google Scholar
  31. Canito A, Aleid K, Praça I, Corchado J, Marreiros G. An ontology to promote interoperability between cyber-physical security systems in critical infrastructures. 2020 IEEE 6th International Conference on Computer and Communications (ICCC), pp. 553–60, December 11–14 2020. doi: 10.1109/ICCC51575.2020.9345163.
     Google Scholar
  32. Ralph P. Toward methodological guidelines for process theories and taxonomies in software engineering. IEEE Trans Softw Eng. 2019;45(7):712–35. doi: 10.1109/TSE.2018.2796554.
     Google Scholar
  33. Nickerson RC, Varshney U, Muntermann J. A method for taxonomy development and its application in information systems. Eur J Inform Syst. 2013;22(3):336–59. doi: 10.1057/ejis.2012.26.
     Google Scholar
  34. Humbatova N, Jahangirova G, Bavota G, Riccio V, Stocco A, Tonella P. Taxonomy of real faults in deep learning systems. presented at the Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. Seoul, South Korea, October 5–11, 2020. doi: 10.1145/3377811.3380395.
     Google Scholar
  35. Lebeuf C, Zagalsky A, Foucault M, Storey MA. Defining and classifying software bots: a faceted taxonomy. 2019 IEEE/ACM 1st International Workshop on Bots in Software Engineering (BotSE), pp. 1–6, Montréal, Canada, May 28–29 2019. doi:10.1109/BotSE.2019.00008.
     Google Scholar
  36. Mountrouidou X, Billings B, Mejia-Ricart L. Not just another internet of things taxonomy: amethod for validation of taxonomies. Internet of Things. 2019;6:100049. doi: 10.1016/j.iot.2019.03.003.
     Google Scholar
  37. Silge J,Robinson D. Text Mining with R: A Tidy Approach.O’Reilly Media; 2022.
     Google Scholar
  38. Chipman H, Tibshirani R. Hybrid hierarchical clustering with applications to microarray data. Biostatistics. 2006;7(2):286–301. doi: 10.1093/biostatistics/kxj007.
     Google Scholar
  39. uc-r.github.io. UC Business Analytics R Programming Guide: Hierarchical Cluster Analysis.University of Cincinnati; 2018. [cited 2023 November 20]. Available from: https://uc-r.github.io/hc_clustering.
     Google Scholar
  40. Schlackl F, Link N, Hoehle H. Antecedents and consequences of data breaches: a systematic review. InfManage. 2022;59(4):103638. doi: 10.1016/j.im.2022.103638.
     Google Scholar
  41. Miller GA. The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychol Rev. 1956;63(2):81–97. doi: 10.1037/h0043158.
     Google Scholar
  42. Calisher CH. Taxonomy: What’s in a name? Doesn’t a rose by any other name smell as sweet? (in eng). Croat Med J. 2007;48(2):268–70. Available from: https://europepmc.org/article/pmc/pmc2080517.
     Google Scholar
  43. Verizon. VERIS: the vocabulary for event recording and incident sharing. 2023. [cited 2023 November 20]. Available from: http://veriscommunity.net/index.html.
     Google Scholar
  44. State of California Department of Justice. Search data security breaches. 2023. [cited 2023 November 20]. Available from: https://oag.ca.gov/privacy/databreach/list.
     Google Scholar
  45. US Department of Health and Human Services. Breach portal: notice to the Secretary of HHS breach of unsecured protected health information. 2023. [cited 2023 November 20]. Available from: https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf.
     Google Scholar
  46. Forina M, Armanino C, Raggio V. Clustering with dendrograms on interpretation variables. Anal Chim Acta. 2002;454(1):13–9. doi:10.1016/S0003-2670(01)01517-3.
     Google Scholar
  47. Le DC, Zincir-Heywood N, Heywood MI. Analyzing data granularity levels for insider threat detection using machine learning. IEEE T Netw Serv Man. 2020;17(1):30–44. doi:10.1109/TNSM.2020.2967721.
     Google Scholar
  48. Levshun D,Kotenko I.Asurvey on artificial intelligence techniques for security event correlation: models, challenges, and opportunities. Artif Intell Rev. 2023. doi: 10.1007/s10462-022-10381-4.
     Google Scholar
  49. Paradis E. R for Beginners. 2005. [cited 2023 November 20]. Available from: https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf.
     Google Scholar
  50. Blashfield RK. Mixture model tests of cluster analysis: accuracy of four agglomerative hierarchical methods. Psychol Bull. 1976;83(3):377. doi: 10.1037/0033-2909.83.3.377.
     Google Scholar
  51. Ward JH. Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 1963;58(301):236–44. doi:10.1080/01621459.1963.10500845.
     Google Scholar

Most read articles by the same author(s)