4.0 CHAPTER FOUR: DATA ANALYSIS

4.1 Introduction

Data analysis is the epitome of an evidence-based research study. This section employs all the statistical and data analysis possible in wrangling the data to extract useful information from the data that can shed more light on the topic under study. Regardless of this importance, data analysis depends on several factors in the entire research study. Elements such as the study’s objectives, the data available, research questions, results presentation, and the general purpose of the research study. These factors combine to determine appropriate data analysis and statistical techniques to be applied in the data analysis process. Statistically, the process of data analysis is not a liner such that the process does not start at a point and end at another; instead, it involves jumping from section to section, altering variables, assessing the effects to identify the feasible solution of the models, and the tests (Pallant, 2020).

Research study data can exist in two forms; primary data and secondary data. Primary data is characterized by first-hand information that has not been published in any media form. Such data can be obtained from surveys, interviews, experiments, observations, etc. this data is considered more resourceful since no alteration has been conducted on the data. Likewise, obtaining firsts hand information is expensive and time-consuming since a lot of resources are spent while conducting an experiment, interviews, and the like. On the other hand, secondary data involves the data available in books, journals, social media, and other information records. Most of the time, this has been synthesized, and the useful information pushes in books and other sources. Obtaining this data is simpler as compared to primary data since no experiment or a survey is needed.

In this research study, primary data from the National Institutes of Health U.S. Department of Health and Human Services will be used for data analysis. The survey aimed at collecting the health information of the U.S residence concerning economic life, social life, and health life. All these three perspectives will form the basis of the data analysis (Ezzy, 2013). The independent variable, explanatory variables, as well as confounding variables, will be extracted from the three life aspect. Generally, all these three aspects of life, social life, economic life, and health life, interact naturally in the environment, creating association and cause effects. In order to investigate these associating and effects, a comprehensive analysis is therefore required to obtain, measure, and quantify and association as well as the outcome.

4.2 Data Description

Data description involves an in-depth understanding of the data and the data features. Data features are the variables constituting the data. Variables in a data set can be either continuous variable, categorical variable, nominal variable, or string variables (Ott, & Longnecker, 2015). The dataset contains all these variables, and thus it is diverse in composition. The size of the data commonly referred to as the sample size, describes the total number of rows present in the data. In the case of a survey, like in this case, the sample size represents the total number of respondents that took part in the survey. A sample size of 3500 was obtained to be used in the data analysis. Further, data analysis involves variable exploration to identify the essential variables in the study and disregard the less useful variables. Generally, not all variables in the data that are useful, and thus a comprehensive understanding of the data and the objective of the study are required in choosing the relevant variables. A total of 449 variables are present in the data and 3504 cases. This translates to 3504 rows and 449 columns; this represents the data dimension.

4.2 Data Transformation

Data cleaning constitutes the most time consuming and tedious process in data analysis. Cleaning the data involves the creating of new data variables, density transformation, coding and recoding the data, labeling the data, catering to the missing data, etc. all these processes should be conducted before the process of data analysis commences. In most cases, a new variable generation that involves calculating total scores in SPSS is desirable, primarily when there exist several variables measuring the same attribute (Pallant, 2020). Population attributes can be measured by one or more parameters, and thus aggregating these parameters would provide a more accurate estimate of the population attribute. The dependent variable in this research study, knowing the HPV virus, has been measure by several variables on the L section labeled HPV awareness. This implies that HPV awareness has several parameters deciding and measuring the awareness of the respondent.

In this, we need to compute the total scores of the awareness variables and computes a categorical variable based on the total scores. The categorical variable will have two categories; HPV aware and HPV unaware. Let us call the variable

The coding on these variables implies a negative coding, and thus, the higher the score, the less knowledgeable the respondent. Since all the variables are coded uniformly, reversal coding is not necessary, but for instance, in the presence of both negative and positive coding, reversal coding is appropriate to maintain a unique variable code uniformly. The explanatory variables, Age, education level, and social-economic level are measure by only one variable each, and thus total scores will not be calculated. The presence of confounding effects in a model cannot be rule out and should always be accounted for. Two variables acting as the confounding attributes will be an investigation in the model’s effects. In other words, how do the effects of these variables on the association of having HPV knowledge and the economic as well as the social factors?

The social beliefs variable will be computes from the social beliefs factors on the survey section N by computing the total scores.

4.3 Descriptive Statistics

Descriptive statistics involves the computation of population parameters that describe the data at a more broad perspective than mare visualization using the human eyes. Means, medians, variances, correlations, a measure of skewness, graphs, charts, and tables are some of the techniques used in descriptive statistics. Additionally, hypothesis formulation and generation are grounded on the descriptive analysis.

4.3.1 Descriptive statistics for the dependent variable

Descriptive statistics assist the researcher in comprehensively understanding the data through the process of data exploration. This generally involves the computation of means, medians, variance, frequency tables, charts, and graphs. Descriptive statistics can be divided into various categories depending on the data type and the nature of the statistics being computed. Categorical data descriptive statistics involve the computation of frequency tables, cross-tabulation tables, charts, and graphs. On the other hand, the continuous data descriptive statistics involve the computation of means, variances, histograms, box plots, etc. All these statistics and charts aid in understanding the data at hand (Agresti, 2018).

Table 1.0 below represents a frequency table of the dependent variable. The dependent variable distribution can also be shown. The category’s distribution is relatively equal, with a small difference in the categories size. 58.5% of the respondents indicated that they had heard HPV. In other words, 58.5% of the respondents are aware of HPV; on the other hand, 40.1% indicated that they had never heard of HPV. 1.4% of the responses were missing.

Table 1.0 Descriptive statistics for Dependent variable

L1. Have you ever heard of HPV? HPV stands for Human Papillomavirus. It is not HIV, HSV, or herpes.
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	Missing data (Not Ascertained)	50	1.4	1.4	1.4
	Yes	2050	58.5	58.5	59.9
	No	1404	40.1	40.1	100.0
	Total	3504	100.0	100.0

Fig 1.0: A bar graph of the dependent variable

Fig 1.1: A bar graph of the dependent variable

4.3.2 Descriptive statistics for the response variables.

Fig 1.2: A bar graph of INCOMERANGES

Table 1.1: Descriptive statistics for Occupational status

O2. What is your current occupational status?
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	Missing data (Not Ascertained)	43	1.2	1.2	1.2
	Multiple responses selected in error	66	1.9	1.9	3.1
	Employed	1696	48.4	48.4	51.5
	Unemployed	115	3.3	3.3	54.8
	Homemaker	161	4.6	4.6	59.4
	Student	55	1.6	1.6	61.0
	Retired	1113	31.8	31.8	92.7
	Disabled	233	6.6	6.6	99.4
	Other – Specify	22	.6	.6	100.0
	Total	3504	100.0	100.0

Table 1.2: Descriptive statistics for Education Level

EDUCB. What is the highest level of school you completed? 5 Levels (Derived from Education; see History Document for mo
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	Missing Data (Not Ascertained)	51	1.5	1.5	1.5
	Less than High School	275	7.8	7.8	9.3
	High School Graduate	631	18.0	18.0	27.3
	Some College	1039	29.7	29.7	57.0
	Bachelor’s Degree	910	26.0	26.0	82.9
	Post-Baccalaureate Degree	598	17.1	17.1	100.0
	Total	3504	100.0	100.0

Table 1.3 Descriptive statistics for the source of information

A2. The most recent time you looked for information about health or medical topics, where did you go first?
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	Missing data (Not Ascertained)	13	.4	.4	.4
	Missing data (Filter Missing)	10	.3	.3	.7
	Multiple responses selected in error	385	11.0	11.0	11.6
	Question answered in error (Commission Error)	64	1.8	1.8	13.5
	Inapplicable, coded 2 in SeekHealthInfo	644	18.4	18.4	31.8
	Books	88	2.5	2.5	34.4
	Brochures and pamphlets	87	2.5	2.5	36.8
	Cancer organization	11	.3	.3	37.2
	Family	64	1.8	1.8	39.0
	Friend/Co-worker	25	.7	.7	39.7
	Doctor or health care provider	390	11.1	11.1	50.8
	Internet	1664	47.5	47.5	98.3
	Library	9	.3	.3	98.6
	Magazines	18	.5	.5	99.1
	Newspapers	6	.2	.2	99.3
	Telephone information number	20	.6	.6	99.8
	Complementary, alternative, or unconventional practitioner	6	.2	.2	100.0
	Total	3504	100.0	100.0

Table 1.4: Descriptive statistics for beliefs about cancer

N2. How easy is it for you to imagine yourself developing cancer in the future?
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	Missing data (Not Ascertained)	84	2.4	2.4	2.4
	Missing data (Filter Missing)	13	.4	.4	2.8
	Multiple responses selected in error	1	.0	.0	2.8
	Question answered in error (Commission Error)	249	7.1	7.1	9.9
	Inapplicable, coded 1 in EverHadCancer	344	9.8	9.8	19.7
	Extremely difficult	509	14.5	14.5	34.2
	Somewhat difficult	630	18.0	18.0	52.2
	Neither difficult nor easy	1157	33.0	33.0	85.2
	Somewhat easy	398	11.4	11.4	96.6
	Extremely easy	119	3.4	3.4	100.0
	Total	3504	100.0	100.0

Fig 1.3 A bar graph of ImagineCancer

Table 1.4.1: Descriptive statistics for beliefs about cancer

N1. How worried are you about getting cancer?
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	Missing data (Not Ascertained)	52	1.5	1.5	1.5
	Missing data (Filter Missing)	13	.4	.4	1.9
	Multiple responses selected in error	2	.1	.1	1.9
	Question answered in error (Commission Error)	254	7.2	7.2	9.2
	Inapplicable, coded 1 in EverHadCancer	339	9.7	9.7	18.8
	Not at all	627	17.9	17.9	36.7
	Slightly	783	22.3	22.3	59.1
	Somewhat	830	23.7	23.7	82.8
	Moderately	418	11.9	11.9	94.7
	Extremely	186	5.3	5.3	100.0
	Total	3504	100.0	100.0

Fig 1.4 A bar graph of FreqWorryCancer

Table 1.5: Descriptive statistics for Age

Descriptive Statistics
	N	Minimum	Maximum	Mean	Std. Deviation
O1. What is your Age?	3417	18	97	57.02	16.729
Valid N (listwise)	3417

Fig 1.3: Histogram of variable Age

4.4 Chi-square test for independence

The person chi-square test, commonly known as the chi-square test for independence, is used to investigate the existence of difference in two categorical variables. The chi-square tests the hypothesis that:

H0: There exist independent among the categorical variables (no association)

H1: There is no independence among the categorical variables. (Association exists)

Several assumptions must be made to ensure the validity of the chi-square test of independence. This includes the variables that must be either categorical or nominal; the variables should have two or more categorical independent groups. To investigate the association between knowledge of HPV and education level, a chi-square test of independence was fitted. The resulting likelihood ratio test score was 370.572, with a corresponding p-value of 0.000. this implies that we reject the null hypothesis of no association. This translates to a statistically significant association between knowledge of HPV and education level.

Table 1.6: Chi-square test for independence for the dependent variable and education level

Chi-Square Tests
	Value	df	Asymptotic Significance (2-sided)
Pearson Chi-Square	495.376^a	10	.000
Likelihood Ratio	370.572	10	.000
Linear-by-Linear Association	41.334	1	.000
N of Valid Cases	3504
a. 2 cells (11.1%) have an expected count less than 5. The minimum expected count is .73.

Table 1.7: Chi-square test for independence for the variable HEARHDPV and INCOMERANGES

Chi-Square Tests
	Value	df	Asymptotic Significance (2-sided)
Pearson Chi-Square	249.089^a	18	.000
Likelihood Ratio	237.943	18	.000
Linear-by-Linear Association	20.597	1	.000
N of Valid Cases	3504
a. 4 cells (13.3%) have an expected count less than 5. The minimum expected count is 2.53.

4.5 Model Building

4.5.1 Logistic Regression

In research projects, model building involves fitting statistical and data analysis models to express the relationship between the variables in the data. The main goal of model building is the prediction and forecasting process, where the dependent variable is predicted depending on the response variables. In this case, a binary logistic regression model was fitted to model the knowledge of HPV depending on Age, Education level, Social-economic factors, Social beliefs about cancer, and the related concepts. The dependent variable in this research project is a categorical variable with the categories, and thus a logistic regression model was the most appropriate (Allison, 2012).

In order to investigate the effects of social-economic variables and social beliefs about cancer, three models were fitted, excluding the confounding factors each at a time and the effects investigated. Confound=ding factors, in this case, are the social-economic factors and social beliefs about HPV.

Table 1.8: Descriptive statistics for the logistic regression model

Model Summary
Step	-2 Log likelihood	Cox & Snell R Square	Nagelkerke R Square
1	3865.298^a	.535	.580
a. Estimation terminated at iteration number 20 because maximum iterations have been reached. The final solution cannot be found.

Tables 1.8 above represents the model summary of a logistic regression fitted with all the explanatory variables. The model was statistically significant, with a corresponding R square score of 0.580. This indicates that the model was able to account for about 58% of the variance occurring in the HPV awareness, as explained by the explanatory variables. All the variables were statistically significant, with corresponding p values <0.000. significant variables in the model refer to the models that contribute immensely to the overall model composition and cannot be omitted from the model.

Table 1.9: Model significance

Hosmer and Lemeshow Test
Step	Chi-square	df	Sig.
1	17.865	8	.022

5.0 CHAPTER FIVE: RESULTS DISCUSSION AND INTERPRETATION

5.1 Introduction

In this chapter, we introduce the data analysis results, discussion, and findings. The entire hypothesis stated in the formulation of the objectives and research questions development will also be investigated and answered. Marjory, we will focus on the results from the logistic regression model, variables association, and the descriptive statistics of the variables. Data analysis showed that there exists a statistically significant association between HPV awareness and Age, level of education, social-economic status, source of information, and personal beliefs. The logistic regression fitted produces an R square value of 58%, indicating that the model could account for bout 58% of the variance occurring in the HPV awareness as indicated by both the mediating and explanatory variables.

Logistic regression presents the overall model significance of the variables. Additionally, an individual variable’s Significance can be accessed through the variables coefficients and their respective significance scores. Inspecting these coefficients, we can be able to identify the most significant variables that predict HPV awareness. Additionally, less significant variables can also be identified and be omitted from the final model since they do not present any useful information. On the other hand, to investigate the association between two dependent variables, the chi-square test of independence is applied. Results from these tests will also be presented.

5.2 Results interpretation

Model evaluation involves the investigation of model accuracy, consistency, and validity. Statistical models involving prediction are prone to errors resulting from differences are in the research. Due to this accessing, the model accuracy is required. In logistic regression, the model accuracy and validity are accessed by the R squared scores and the predictor variables. The r square score was above 50%, and all the explanatory variables were statistically significant. This implies that the model was statistically valid and consistent.

Hosmer and Lemeshow Test, table 1.9 in the data analysis section, gives the statistical accuracy of the model. The chi-square significance level of 0.022 was observed. Indicating the model was statistically significant. The validity of the model can also be asses using the coefficients of the explanatory variables. This coefficient gives the effects each variable has on the entire model. In other words, they represent the expected change to the dependent variable when the explanatory variable is changing with one unit. In the case of a categorical explanatory variable, the coefficient represents the change on the dependent variable with respect to the.

5.3 Results discussion

The data from the survey was obtained from a sample of the USA population, and thus the analysis results depict the analogy of HPV awareness in the USA. Results from the data analysis indicate that majority of the USA population is aware of the HPV virus accounting for about 58.5% of the total sample and 40.1% of the sampled population is not aware of the HPV virus. These numbers trigger the assumption that the community is not yet aware of the HPV virus, and this calls for HPV awareness from health ministries and the government. Controlling and combating the HPV virus begins with the public awareness of the virus, how it manifests, some of the control measures, treatment if there is, and the control measures. These measures cannot be achieved if the community is not aware of HPV.

The Internet as the source of information concerning health and medical topics dominated among the respondents accounting for about 47.5%. This can be associated with rapidly growing technology worldwide. Sources of information about health exist in several forms, and medium, among these forms presented to respondents, were Internet, health care, library, magazines, etc. health center was the second frequent source of information among the USA respondents. Since the majority of the population is focusing on the Internet and health care facility for the source of information concerning health, the ministry of health should focus mainly on the two areas while passing information concerning health to capture as many people as possible.

Social beliefs about cancer are viewed as a confounding factor affecting the association between cancer awareness and the explanatory variables in the research. 23.7% of the respondents are somewhat worried about getting cancer, 17.9% are not worried about getting, while 5.3% are extremely worried. This indicated that a larger number of the USA population are worried about contracting cancer. This may be due to having experience with the disease, getting information about cancer and its effects from several sources. Additionally, the majority of the respondents are moderate about getting cancer in the future, accounting for about 38.0%

Statistically, a significant difference was established between cancer awareness and the level of education complete with respective p-values score of <0.05. The respondents who had higher education experience presented by the level of education completed indicated that they had heard cancer, implying they were aware of cancer. A majority of the respondents who had completed 11years and below indicated that they had not heard about cancer. The chi-square test associated HPV awareness with a high level of education completed. Interestingly, upon conducting a chi-square test on the HPV awareness and the occupational status, a higher number of USA population that is employed are aware of HPV. Nevertheless, HPV awareness is lower among students and the unemployed population.

References

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. CRC press.

Agresti, A. (2018). An introduction to categorical data analysis. John Wiley & Sons.

Pallant, J. (2020). SPSS survival manual: A step by step guide to data analysis using IBM SPSS. Routledge.

Kumar, R. (2019). Research methodology: A step-by-step guide for beginners. Sage Publications Limited.

Mackey, A., & Gass, S. M. (2015). Second language research: Methodology and design. Routledge.

Sperandei, S. (2014). Understanding logistic regression analysis. Biochemia Medica: Biochemia Medica, 24(1), 12-18.

Allison, P. D. (2012). Logistic regression using SAS: Theory and application. SAS Institute.

Sharpe, D. (2015). Chi-Square Test is Statistically Significant: Now What?. Practical Assessment, Research, and Evaluation, 20(1), 8.

Vaske, J. J. (2019). Survey research and analysis. Sagamore-Venture. 1807 North Federal Drive, Urbana, IL 61801.

Bickel, P. J., & Lehmann, E. L.