Decision Tree Models for Predicting the Effect of Electronic Waste on Human Health

— Informal processing of electronic waste has become one of the commonest sources of employment in developing countries which has contracted a great impact on human health due to the improper disposal of the heavy metals found in these waste materials. Several research works have been conducted to predict e-waste generation and management. Unfortunately, there is no study to predict the disease associated with the activities of informal e-waste products and their disposal. This study predicts the categorized disease of a person working and/or living at an electronic waste dump site based on their activities and their lifestyle using decision tree algorithms. The categorized diseases are skin, respiratory and reproductive diseases. The work compared the performance of C4.5 algorithm which used the Chi-squared test for tree pruning to handle overfitting with the Classification and Regression Tree (CART) algorithm which used tree depth control to handle overfitting. The C4.5 algorithm proved to be more effective than the CART algorithm. The study recommends that whenever two or more algorithms can be used to handle the same problem in principle, they should all be used and their results be compared.

These lighting equipment include fluorescent tubes, halogen bulbs, light emitting diode (LED) bulbs, sodium bulbs and high-intensity discharge lamps.Another source of e-waste is obtained from consumer equipment like television sets, radios, musical equipment, audio players, cameras and hi-fi equipment.
The list has not ended, at the hospital and clinics, we have medical devices which also contribute to e-waste generation.These devices include dialysis machines, medical fridges and freezers, ultrasonic machines, cardiology equipment, analysis, respiratory ventilators and thermometers.Monitoring and control instruments, and dispensers like microcontrollers, sensors, valves, heating regulators, smoke detectors, thermostats, and drinking dispensers.Spray and money dispensers also contribute to e-waste generation.Furthermore, Information Technology (simply called IT) and telecommunication devices like laptop computers, desktop computers, mobile phones, calculators, printers, scanners, photocopying machines, network cables and Wi-Fi modems also produce e-waste.
E-waste has become a topical issue due to the hazards it poses to the environment and human health.The rising quantities of e-waste in Ghana are driven by the demand for second-hand electronic products and secondary resources by refurbishment and dismantling outlets as an incomegenerating opportunity for people.The uncontrolled collection of end-of-life electronic equipment has resulted in the problem of e-waste leading to major harm to the environment and human health [3].
The issue of proper e-waste management is critical to the protection of livelihoods, health, and the environment.It poses a serious challenge to modern societies and necessitates significant efforts to address it to achieve sustainable development.One such emerging livelihood pattern in Ghana is e-waste recycling, which was virtually unknown in the Ghanaian urban occupational vocabulary until a few years ago.The majority of the time, e-waste is dumped, sold to people, or burnt to retrieve metals and other components.Commonly categorized complications associated with ewaste are respiratory diseases, reproductive diseases, and skin diseases [4].Some research works have been conducted to determine ways to reduce the effect of e-waste on the environment and human health [5], to estimate e-waste generation for management plans [6], [7], to predict e-waste recovery scale [8], to predict the area of concentration of electronic devices that generate e-waste [9] and to produce a predictive model to screen informal e-waste recycling sites [10].
No research work has developed a model to predict the This paper develops a model to predict the kind of disease that can affect people living at and/or working in e-waste sites.It uses the e-waste products in the human environment, how they are disposed and the activities of the people living and/or working in ewaste sites to predict their health hazards using a machine learning mechanism based on decision tree algorithms.A decision tree is a nonparametric supervised learning technique which can be used for classification and regression problems.It provides a tree-like structure for modelling problems.A decision tree can handle numerical and categorical data as well as multi-output problems.The time cost of executing a decision tree model is logarithmic due to its tree structure which skips half of its total nodes at each level beginning with the root at level zero.A decision tree can be regression-based or classification-based.With a regression-based decision tree, the data for the model is continuous.However, when the data is categorical, the decision tree is classification-based.For a better application of the decision tree, the feature of the dataset should preferably be categorized or discrete [11].
Decision tree comes with various algorithms.The commonly known algorithms are Iterative Dichotomiser 3 (ID3), C4.5, Classification and Regression Tree (CART), Chi-squared Automatic Interaction Detector (CHAID) and Multivariate Adaptive Regression Splines (MARS) [11].ID3 divides iteratively using information gain and sometimes bagging ensemble attribute selection methods to generate its model tree [12].The J48 classifier is based on ID3.ID3 is susceptible to overfitting when the data is continuous and it is usually supported with other methods.[12]- [14].
C4.5 is considered better than ID3 as it can handle both discrete and continuous data.It uses information gain and bagging ensemble attribute selection methods to generate its model tree [15].It solves overfitting by pruning redundant branches and subtrees [11].C4.5 has been used in many applications like classifying stunting in children [16] and assisting students' project supervision [17].C4.5 has been used with other machine learning algorithms for developing efficient models.C4.5 was used with CART for intrusion detection mechanism [18], a neural network for diagnosing related diseases associated with patients with appendectomy to manage insurance payment [19], and a rule-based mechanism for storing items in a warehouse [20].
CART is used for both classification and regression problems.It uses the Gini attribute selection method to generate its decision tree [11].Like the other types, CART has been used for many applications alone and with other decision tree algorithms [18], [21].CHAID has been an old method which is hardly used today.It uses the chi-square test and F-test as the attribute selection methods for classificationbased and regression-based problems respectively.Another method for generating decision trees is MARS which is specifically used for regression-based problems with nonlinear data [11].
Decision trees are very useful when the problem is to be understood clearly as well as observing all possible causes that lead to the decision-making.This study classifies the health issues of people living at and/or working in e-waste sites.It is a classification-based problem with categorical data.Thus MARS is ignored.The suitable algorithms that can be used for this work are C4.5 and CART.ID3 cannot be used due to its being prone to overfitting.CHAID too is obsolete and cannot be used.The attribute selection measures for C4.5 and CART are respectively information gain and Gini index.Unlike CART, C4.5 supports its decision-making with rules but cannot handle missing data.Both algorithms were used in this study.The results were compared and the better model was chosen for deployment.

A. Data Collection
The study focused on predicting the effects of e-waste on human health, therefore our concentration was on the people who are living close to the e-waste site as well as the e-waste workers.The research was carried out through a questionnaire.Data was collected from 187 respondents working and/or living at the e-waste dump site.The study did not look at respondents less than 18 years old.Table I shows the features and associated values expected from respondents after categorizing certain features.
Alcohol consumption None (0), Low (1), Moderate (2), High Years in e-waste work None (0), Low (1), Moderate (2), High From Table I, the values in the parenthesis are the encoded values used for machine learning modelling since machine learning modelling usually deals with numbers [11].All the values have been categorized and they are measured on the nominal scale.Various categorizations of age groups have been proposed by researchers and organizations [22]- [25], however, this study adopts the risk stratification age grouping [25] as the study is related to health issues.Thus, the age categorization has three values, specifically youth, middleaged and elderly.The pediatric group consisting of 0 to 14 years were ignored.The youth value has continuous values from 18 years of age to 47 years of age.The early ages of youth from 15 to 17 years were also ignored in this study.The middle-aged group is from 48 years old to 63 years old whilst the elderly group consists of people of at least 64 years old.
Regarding alcohol consumption, the categorization was based on the American Dietary Guidelines such that for daily consumption, low categorization requires less than two bottles of alcohol for males and less than one bottle for females.Moderate categorization in a day requires two bottles of alcohol for males and one bottle of alcohol for females.Any additional bottles of alcohol consumption in a day after the moderate category is considered high [26].
The categorization of the number of years working in ewaste sites, living at e-waste dump sites, and years spent burning e-waste products was based on the principles of lung healing after quitting smoking [27].The categorization is low when the number of years is less than or equal to 5. The categorization is moderate when the number of years is after 5 years but less than or equal to 10 years.The categorization is high when the number of years is after 10 years.
The health issue was categorized as none, respiratory diseases, skin diseases, and reproductive issues.These are broadly categorized.The reproductive diseases involve disease, complications or disorders like infertility, menstrual disorder, spontaneous abortion, endometriosis, endometrial cancer, breast cancer, poor sperm motility, and low quality of semen which are usually caused by lead and cadmium [28], [29].Skin diseases on the other hand include burns, rashes, scars, lacerations, and itches [30].Respiratory diseases, as defined by this study, consist of diseases and disorders like lung functioning impairment, chronic obstructive pulmonary disease (COPD), asthma, breathlessness, and chest tightness usually caused by metals like mercury, arsenic, chromium, lead, and polybrominated flame retardants barium [31], [32].

B. Algorithms Design and Implementation
Two decision tree algorithms, C4.5 and CART, were used as both algorithms qualified.The algorithm with a higher accuracy rate was then chosen.C4.5 in this study used the information gain attribute selection method which depends on the entropy.The entropy, E(S), measures the uncertainty or impurity of the data.Given N as the number of classes in the data set with a total sample, S, such that the probability of choosing a class,   , is given as: where    is the total number of elements labelled as   .Then: The entropy of each feature after splitting is computed.A small value of the entropy is always desired.The change in entropy is called the information gain (IG).Thus when the information gain is large, the model is good.IG is therefore computed as IG = Entropy before split -Entropy after split Sometimes IG results in bias as it focuses on the node with a high number of values.This problem is handled by calculating and using Gain Ratio (GR) as: The split information refers to the entropy of sub-dataset proportions.
Overfitting in C4.5 is handled by pruning.This study used a chi-square test to prune unwanted parts of the decision tree.The chi-square test was employed since the dataset was categorical.The threshold, t, was obtained from the statistical table for a particular confidence level.This study set the confidence level to 95%.A custom program was developed for C4.5 in this study using Python.
The pseudocode below shows how C4.Repeat until no more split can be removed 21 Output the final and pruned decision tree 22 Stop CART, on the other hand, uses the Gini index (GI) for the attribute selection method which measures the purity or impurity as: where   is the probability of choosing a class.
The attribute with the lower GI value is used as the best attribute or feature to split.The algorithm below shows the implementation of the CART algorithm: The minimum Gini index in the algorithm is set to 1 which will be changed during the first iteration.Setting it to zero may be detrimental to the implementation of the algorithm since a least Gini index may not be zero.The default Python implementation of the decision tree algorithm is based on the CART algorithm.This is used by importing DecisionTreeClassifier from sklearn.tree package.Overfitting in CART is controlled by using the depth of the tree.This is controlled by using max_depth as a parameter to the DecisionTreeClassifier.The deeper the tree, the more complex the model becomes and the more the model is prone to overfitting.The depth of the tree is controlled in this study.

III. RESULTS
The results of both C4.5 and CART algorithms are presented and discussed in this section.The model of the C4.5 starts with the years spent working in an e-waste site.Fig. 1 shows a two-level structure of the decision tree.The rectangular shapes represent the dataset features or attributes whereas the oval shapes represent the prediction of the health issues.
From Fig 1, the model combined some of the values of the various attributes.For instance, the values of the root node (years in the e-waste site) were grouped into two: none/low and moderate/high.People living in an e-waste dump site can be free from any of the categorized diseases when they do not eat food from the e-waste dump site and do not burn any ewaste product for metals.Any other activity will result in one or more of the categorized diseases like skin diseases, respiratory diseases and reproductive diseases.Table II shows the classification report when the final model which has been pruned was tested with the test data.From Table II, the precision examined on each side individually, has 1.00 (100%) correction for none, reproductive diseases, and respiratory diseases without any false positives.Skin disease has a 0.88 (88%) correct prediction.The recall examines the completeness of the predictive features.Except for the skin disease, all the predictions have less than 1.00 (100%) complete due to false negatives.The F1-Score determines the accuracy of the predictive features by taking into account both precision and recall.Better models have F1-Score approaches 1 or 100%.The "None" prediction has the highest accuracy of 97% whereas the reproductive diseases prediction has the lowest accuracy level of 87%.However, the overall of the model was 91%.
The training and testing of the CART algorithm were done by specifying the depth of the decision tree starting with 1.The random_state of the train_test_split method was set to 1 to ensure that the splitting of the dataset into training and testing is maintained always.
The ratio of the training and testing dataset was set to 7:3 respectively of the total dataset.Table III shows the depth of the decision tree and the associated accuracy level.From Table III, there was no change in the accuracy level (82%) when the depth was set to 6 or more.The increasing depth from 7 to full tree depth unnecessarily increased the complexity of the model and also increased overfitting.The optimal model for the CART algorithm was obtained at depth 6. Fig. 2 shows the two-level structure of the decision tree generated by the CART algorithm.
From Fig 2, the model combined some of the values of the various attributes as done by the C4.5 algorithm.For instance, the values of the root node (years in the e-waste site) were grouped into two: none and low/moderate/high.People living in an e-waste dump site can be free from any of the categorized diseases when they do not drink water from an ewaste dump site and do not live at an e-waste dump.However, if they choose to live at an e-waste dump site, they should not burn e-waste products for more than 5 years.Any other activity will result in one or more of the categorized diseases like skin diseases, respiratory diseases, and reproductive diseases.Table IV shows the classification report when the final model was tested with the test data.From Table IV, only respiratory disease had a precision of 1.00 (100%) indicating correct prediction without any false positive.All the predictive features recorded false negatives since no feature has a recall of 1.00 (100%).Except for the accuracy of the respiratory diseases by taking into account both precision and recall exceeding the overall accuracy of the model, all the predictive features had a lower accuracy rate.The overall accuracy level of the model was 82% Comparing the accuracy levels of the C4.5 and CART algorithms, it can be deduced that the C4.5 algorithm produced a better model than the CART algorithm.Both decision trees started with the same root node (years spent in e-waste work) but generated different child nodes and different splitting criteria.The C4.5 model was deployed whereas the CART model was ignored.

IV. CONCLUSION
The study developed a model to predict the effect of ewaste on human health based on the activities of the people working and/or living at e-waste sites.Two decision tree algorithms, namely C4.5 and CART, were employed.The C4.5 model was able to predict with an accuracy level of 91% due to the Chi-squared test method used to prune the useless parts of the decision tree.The CART, on the other hand, predicted with an accuracy level of 82%.Though both algorithms started generating the decision tree with the same dataset attribute, the subsequent levels of the tree produced different attributes.The C4.5 with the higher accuracy rate was chosen for deployment.It is, therefore, recommended that whenever two or more algorithms can be used to handle the same problem in principle, they should all be used and their results be compared.Future works will focus on predicting the effect of e-waste on the environment and children.

TABLE II :
CLASSIFICATION REPORT OF TEST DATA FOR C4.5 ALGORITHM

TABLE III :
DECISION TREE DEPTH MANAGEMENT FOR CART ALGORITHM

TABLE IV :
CLASSIFICATION REPORT OF TEST DATA FOR CART