Abstract
Introduction
The frequent occurrence of stroke has inspired researchers to investigate potential risk factors. Understanding the association between stroke and risk factors is important in understanding, preventing, and controlling the occurrence of stroke. An understanding of the early risk factors that lead to stroke, such as known hypertension, diabetes mellitus, and other common diseases, can lead to more accurate predictions of incidents of stroke. At present, multiple factors are used to diagnose strokes. However, if we are able to correlate risk factors, we can simplify the process of disease diagnosis and reduce medical costs.
A number of studies have been conducted on the risk factors of stroke. Zhang et al. found that coronary heart disease, hypertension, smoking, and obesity were all related to stroke occurrence [1]. Eman et al. investigated stroke risk factors in Egypt; their results showed that hypertension and diabetes mellitus are the chief risk factors of stroke [2]. Research has also indicated that atrial fibrillation increases the risk of stroke, and the occurrence rate of stroke is about 5% among atrial fibrillation patients [3, 4]. Knottnerus et al. found that family history of stroke is an independent risk factor for lacunars stroke [5].
Research on the risk factors of stroke has gradually deepened. Researchers have found that stroke risk factors include hypertension, diabetes mellitus, heart disease, dyslipidemia, smoking, excessive drinking, aging, and genetic factors [6, 7]. These risk factors can be divided into those that are fixed and those that are modifiable. Sex and age are fixed. Modifiable factors include biological factors (heart disease, hypertension, hyperlipidemia, diabetes mellitus, etc.) and behavioral factors (smoking, drinking, weight, depression, etc.).
This study analyzes the association between stroke and risk factors, to determine the combination of high risk factors that leads to strokes. Understanding how risk factors work in combination will improve prevention, early diagnosis, and early treatment.
This paper is organized as follows: in the next section, the research dataset and research methods are described in detail. In Section 3, the association rules are described, and an assessment of these rules is displayed. A discussion and conclusions are presented in Section 4.
Method
Participants
The data used in this study was obtained from a stroke screening and prevention investigation that took place in 2012 and was provided by the Chinese People’s Liberation Army General Hospital Clinical Data Center. The research subjects of the stroke screening database were from a cluster sample of 16 provinces, municipalities and autonomous regions throughout China (including Beijing, Tianjin, Henan Province, Heilongjiang Province, Xinjiang Uygur Autonomous Region, and Sichuan Province). Hospitals, community health service centers, and township health centers throughout China were used as intake points. Every intake unit selected one project-screening site in an urban community and one in a rural township. At each screening site, all residents who were 40 or older (born before December 31, 1973) were registered as screening objects. Residents who lived outside of the screening site for more than half a year were excluded. The sixth national population census was used to determine the ratio of the number of urban communities to the rural townships. The stroke screening database consists of 1,196,422 screening subjects. All survey groups used the same questionnaire, the “Assessment form of paroxysm of high risk group and stroke patient recurrence risk.” Trained and qualified investigators filled in the form following a face-to-face interview with each participant. Information collected include basic demographic information (age, sex, nationality, and district), stroke risk screening items (hypertension, diabetes mellitus, atrial fibrillation, dyslipidemia, obesity, smoking, stroke family history, etc.), and other preliminary screening items.
Data set characteristics
Our study considered whether or not patients experienced any of the following 8 stroke inducing factors: hypertension, atrial fibrillation, dyslipidemia, diabetes mellitus, smoking, exercise, overweight and family history of stroke.
The criteria for each of the 8 factors was:
Hypertension: blood pressure Diabetes mellitus: diagnosed by a doctor with diabetes mellitus or taking drugs prescribed for treating diabetes. Atrial fibrillation: medical history of atrial fibrillation. Dyslipidemia: triglyceride Overweight: body mass index (BMI) Smoking: smoking one or more cigarettes per day for at least one year including past history and current smoking. Lack of exercise: exercising fewer than three times per week, with exercise time Family history of stroke: Direct and collateral stroke within three generations of the patient.
Among 1,196,422 research subjects, 957,325 were chosen randomly as the training set for association analysis. The training set included 15,835 stroke patients (1.65% of the training set) and 941,490 without stroke (98.35%).
Association rules
In 1993, R. Agrawal of the International Business Machines Corporation (IBM) Almaden Research Center first presented association rules mining between each item set in a customer transaction database [8]. This became known as the Apriori algorithm and has become the classic algorithm for association rules analysis. Researchers have conducted numerous follow-up studies on association rules mining, including algorithm optimization and expanding application areas. As an important project of data mining, association rules mining has received extensive attention, and has been widely applied in physical activities, business affairs, financial areas, medicine, and other fields.
The association rule algorithm involves two steps: First, all high frequency items in the set are listed; then, frequent association rules are generated based on high frequency items [9]. High frequency indicates that the term frequency of one item has reached or exceeded a certain level, and the term frequency is the Support degree [10]. It is defined as follows:
When the support degree of {A, B} is greater than or equal to the minimum support degree, then {A, B} is put in the high frequency item group.
The second step of the association algorithm is the generation of association rules. According to the high frequency item group obtained in the first step, if a rule is satisfied within a minimum confidence degree, then the rule is an association rule. The confidence degree is defined as follows:
Common association rules algorithms include: the Apriori, Generalized Rule Induction (GRI), and Frequent Pattern-tree (FP-tree) algorithms. The Apriori algorithm is the classical mining algorithm of Boolean association rules frequency item sets [11]. The Breadth-first search strategy, which exploits the downward closure support property, is used to count the support of item sets and candidate generation function. The Apriori algorithm uses a “bottom up” approach, where frequent subsets are extended by one item at a time. This step is known as candidate generation, and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. The core of the Apriori algorithm is the recursion of frequent item sets in two phases. These association rules belong to the monolayer, single-dimension, Boolean class. The Apriori algorithm can be simplified as follows: first, determine all of the frequent sets that are satisfied with a minimum support degree or minimum confidence degree; and then from these frequent item sets, generate strong association rules that meet minimum support degree and minimum confidence degree. Expected rules can be acquired by the above-mentioned frequent item sets [12]. Each rule results in only one item.
The Apriori algorithm, which is a basic algorithm of association rules, is one of the ten classic algorithms in the field of data mining. The Apriori algorithm can be used to mine potential relationships among data items in various fields. In our research, we introduce the Apriori algorithm to stroke datasets to discover possible associations between stroke risk factors.
Occasionally, the support degree and confidence degree are insufficient for filtering uninteresting rules. In these cases correlation measurements can be extended to the association rules frame to resolve the issue. Immediately, the correlation rule comes into being. It includes not only the support degree and confidence degree, but also the correlation measurement between item sets A and B.
Researchers have studied many assessment metrics, even before mining frequency patterns were written about extensively. Some the previously discovered model evaluation variables are still used frequently.
As a simple correlation measurement, Lift is defined as: if P (A
If the result of the formula is less than one, then the appearance of A and B are negatively correlated, that is, when one is present the other is likely to be absent. If the result of the formula is greater than one, then A and B are positively correlated, that is if one is present, the other is likely to be present. If the result of the formula is equal to one, then A and B are independent, and there is no correlation between the two.
For the association rules model of the stroke risk factors (“suffering hypertension or not” (dfHypertension
For association rules models between risk factors, the above-mentioned 8 stroke risk factors are set as former item and consequent respectively. Likewise, the number 8 is set as the maximum former item.
Results
All possible association rules between stroke and risk factors
In the initial experiment, in order to find out all possible association rules, the minimum support degree was set as 0.0%, and the minimum confidence degree was set as 1.0%. After executing the association model, 256 association rules, including 70 rules whose confidence degree are greater than 50%, were obtained.
Table 1 shows how these 256 association rules were organized. Among all the rules, when the number of former items is 8, the confidence degree is at its greatest, 86.03%. The maximum confidence degree of each association rule decreased as the number of former items decreased: when the number of former items was equal to 7, the maximum confidence degree was 85.66%; when the number of former items was equal to 6, the maximum confidence degree was 83.66%; when the number of former items was equal to 5, the maximum confidence degree was 78.48%; when the number of former items was equal to 4, the maximum confidence degree was 69.05%; when the number of former items was less than or equal to 3, the confidence degree of all rules was less than 50%, of which the maximum confidence degree was 42.87%.
Association rules between risk factors and stroke: consequent
suffering from stroke
Association rules between risk factors and stroke: consequent
Most rules whose confidence degree is greater than 50% are rules in which the number of former items is 6 or 5. Twenty-five rules and 29 rules, account for 77.14% of rules whose confidence degree was greater than 50%. The above experiment results indicate that when a person has more risk factors of stroke, he or she is at increased risk for stroke.
When the number of former items is fixed, the number of combinations formed by the 8 risk factors is fairly large. For instance, when the number of former items is equal to 5, the possible number of combinations is
To find out the meaningful association rules, we divided the 256 association rules by the number of former items, from 1 to 8, into 8 groups. The confidence degree indicates the probability that the consequent (suffering stroke) takes place, in instances where the former item is present. Therefore, the rules with highest confidence degree in every group are viewed as the meaningful rules. Altogether, 8 rules are shown as Table 2. These 8 rules mean that when the number of former items is fixed, if the risk factors of the screening result agree with the following rules, then the risk of the person suffering stroke is larger than others.
Meaningful association rules between risk factors and stroke: consequent
suffering from stroke
Meaningful association rules between risk factors and stroke: consequent
Meaningful association rules among risk factors (1): consequent
When the risk factors of the screening results contain relatively more factors, the individual has a high risk of suffering from stroke. When the screening results contain relatively fewer factors, the individual has a lower risk. When the screening results contain 3–5 risk factors, the risk of suffering stroke is difficult to estimate by intuition; this is especially true for rules 7–8. For screening and prevention of stroke, rules 4–6 as listed in Table 2 are more meaningful. Rule 4: Among individuals who possess 5 risk factors, those who have atrial fibrillation, diabetes mellitus, smoking history, elevated body mass index, and a stroke family history, have a 78.48% higher risk of stroke than others. Rule 5: Among individuals possessing 4 risk factors, those patients who have atrial fibrillation, diabetes mellitus, smoking history, and a family history of stroke, have a high risk of stroke, the risk probability is 69.05%. Rule 6: Among individuals with 3 risk factors, those patients that have a family history of stroke, atrial fibrillation, and diabetes mellitus, have a 42.87% risk of stroke.
In the initial experiment, the minimum support degree was set at 0.0%, and the minimum confidence degree was set at 1.0%. After executing the association model, 1,016 association rules, including 407 rules whose confidence degrees were greater than 50%, were searched out. Then the minimum support degree was set at 0.3%, and minimum confidence degree was set at 50%, and 42 association rules were searched out by the association model. After that, the minimum support degree was set at 0.5%, and the minimum confidence degree was set at 50%, and 25 association rules were searched out.
Twenty-five rules were split into two classes, one with hypertension as the consequent, as shown in Table 3; the other with atrial fibrillation as the consequent, as shown in Table 4. These rules were ordered by confidence degree from largest to smallest.
Meaningful association rules among risk factors (2): consequent
atrial fibrillation
Meaningful association rules among risk factors (2): consequent
The results displayed in Table 3 show that individuals who simultaneously have diabetes mellitus, dyslipidemia, and obvious overweight, have a 71.62% probability of having high blood pressure. The second highest probability of suffering from hypertension is 68.26%, which occurs in those who have dyslipidemia, obvious overweight and a family history of stroke.
The results shown in Table 4 show that individuals who have diabetes mellitus, obvious overweight, and hypertension, are most likely to experience atrial fibrillation (63.28% probability). The group with the second largest probability of experiencing atrial fibrillation (60.94%), is those who have hypertension, obvious overweight, and a family stroke history.
This population-based study of Chinese adults aged 40 years and over found that the most important stroke risk factor is atrial fibrillation, followed by diabetes mellitus, and family history of stroke.
In our training set, 1.65% of the respondents were stroke patients. According to the last association rule in Table 2, with atrial fibrillation as the former item and occurrence of stroke as the consequent, the confident degree was 7.22% – this is about 4.4 times the prevalence rate of the whole training set. Research has shown that atrial fibrillation is one of the highest risk factors of stroke, and is especially common in the elderly group. It has previously been shown that the occurrence rate of cerebral arterial thrombosis among atrial fibrillation patients is five to seven times higher than person who do not have atrial fibrillation [13]. This is similar to the results of our study. Some scholars have claimed that atrial fibrillation patients’ hearts may beat quickly and in a disorderly fashion, blood in the atrium cannot be pump out entirely, stasis of blood in atrium lead to thrombus. After the thrombus breaks off and moves to the brain through the bloodstream, the blood vessel is blocked, and causes a cerebral arterial thrombosis eventually [14].
Changes in insulin and plasma lipoprotein and glucose metabolism dysbolism caused by diabetes mellitus may contribute to the formation of arteriosclerosis and thrombus, which is why diabetes mellitus has been found to be the highest risk factor of stroke. One third of acute stroke patients suffered from diabetes mellitus; these patients tend to be young, and female [15]. The results of this paper are in accordance with the above conclusion.
Table 5 shows further analysis of the association rules among the risk factors. For 12 association rules, and hypertension as the consequent, dyslipidemia as the former item appeared 8 times, diabetes mellitus or obvious overweight as the former item appeared 6 times. For atrial fibrillation as the consequent, there were 13 interesting association rules. When hypertension appeared as the former item 9 times, lack of exercise as the former item appeared 7 times, and obvious overweight appeared as the former item 6 times. The above results indicate that dyslipidemia, diabetes mellitus. and obvious overweight are the most associated factors of hypertension, while hypertension, lack of exercise and obvious overweight are the most associated factors of atrial fibrillation.
Summary of meaningful association rules among risk factors
A large number of medical studies have verified that there are effective interventions or hypertension, and it has the easiest interventions, and the most effective interventions among prime risk factors. If blood pressure can be lowered to a reasonable level, the morbidity of stroke will drop at least 35%–38% [16].
Prevention and intervention of high risk factors can reduce stroke morbidity greatly. The most direct means of prevention is changing of life style, such as consuming a healthy diet with little oil and salt, regular physical exercise, and reducing smoking or drinking.
Conflict of interest
None to report.
