Research Report

Research Report

Student Name

University Name ()

Course Name

Instructor Name

Date ()

Introduction

Stack Overflow is a community of developers, programmers, and data scientists that enriches sharing ideas and solving problems related to technology. Normally, employers in the community advertise for jobs on the stack overflow job boards to target qualified applicants. Other than a job posting, members do also post problems on the community and get solutions and responses from other members. This research report focuses on the job posting problems that arise in the community due to redundancy in the job posting and numerous platforms for the job posting.

The dependent variable in this research report is the use of job boards by the respondents. The explanatory variables include coding as a hobby, age, contribution in open source, years of coding, employment status, undergraduate major, and manager. The analysis will restrict respondents to three countries, i.e., Australia, the Netherlands, and the Russian Federation. The survey data under consideration is the 2019 survey that has been split to contain only the three listed countries. The original data had around 89000 observations and 85 variables. After subsetting the data, we have a data set of around 5,500 observations and 85 variables.

The majority of the variables under consideration are categorical variables, and thus, categorical data analysis techniques and models will be prioritized. Chi-square test of independence will be employed to check for respondents’ differences in the use of job boards; the correlation coefficient will also be established to identify the most correlated variables, charts, and graphsandas frequency tablbe formulated in the descriptive statistics section. Finally, a random forest classification model will be fitted to the model using job boards in respect to respondents’ characteristics.

Problem Statement

Normally, job posting in Stack Overflow needs to be posted on one board to target suitable candidates for the job. Nevertheless, this is not always the case since jobs are posted in numerous places, and this attracts different candidates, both suitable and unsuitable for the posted jobs. Factoring in the candidate’s characteristics, accessing the factors influencing the use of job boards can lead to better targeting of candidates, reduce job posting redundancy, and probably decrease the number of unfit applicants significantly.

Research Method

Research methods involve the entire process of structuring the research. This comprises the data collection methods, data source, data types, how data was collected, analysis methods, presentation, and conveying the results. The research objectives and the structure of the variables normally give insights into the research methods to be employed in a research study. Qualitative research methods that deal with quantitative measurements try to answer quantitative research questions laid down (Bryman, 2016). The research problem also entails prediction through a random forest model. Thus, quantitative techniques will be used.

Research methods, therefore, depend on the objectives of the study, research question, and the structure of the data.

Research Questions

The research questions address the problems laid down in the research project. The research question should be well structure and constructed to comprehensively address the research aims. Normally, they are stated as a question that requires a specific approach.

Is there a significant relationship between respondent’s response to employment status, the response on open source, codding as a hobby, and the use of job boards?
Is there significant evidence to show that the respondent’s response on employment, open-source, and codding as a hobby provides enough evidence for predicting the respondents’ response on the use of job boards
Does the data provide enough evidence for predicting the use of job boards?

Sample

A sample is a representation of the entire population in a research study. Generally, analysis done on a sample is used to generalize the results to the entire population. A sample size of 5500 respondents from three countries was used for the analysis. All the respondents participated in the Stack Overflow survey of 2019. The sample size in a research study is a very important aspect of the analysis given the fact that the analysis results and the model accuracy depend on the size of the sample. Statistically, the accuracy of the model increases with an increase in sample size.

The sample size is also affected by the missing values in a dataset. Thus, missing values should be handled correctly in a way that does not significantly reduce the sample size. Imputing the missing values is the best-recommended process of dealing with missing values. Additionally, casewise elimination is also practical in the case where the missing values are minimal. Casewise elimination deletes all the rows in a dataset that have missing values.

Analysis Method and Limitations

Frequency tables, bar graphs are the most used techniques for analyzing categorical data. A frequency table presents the frequencies of each class in the categorical variable. Additionally, a contingency table can also be used to analyze two categorical variables using their frequency distribution. The Chi-square test of independence investigates the presence of any statistically significant difference in two categorical variables. For instance, the process of accessing whether there is a difference in the respondent’s response to employment and the respondent’s response to student status.

A significant difference in the categories distribution exists, and this highly affects the accuracy of the analysis. The presence of missing values also reduces the sample size of the data, which might affect the accuracy of the overall results (Hennink et al., 2020).

Descriptive Statistics

Fig 1.0: A bar graph of variable Hobbyist

Fig 1.1: A bar graph of variable employment

Fig 1.2: A bar graph of variable Opensourcer

Fig 1.3: A bar graph of variable job boards

Results

Data analysis results show that there s a significant association between respondent’s response to employment and student status. Corollary, the variable Hobbyist is also statically associated with the student’s status as indicated by the chi-square test of independence. The variables opensource is not associated with student status as the p values from the chi-square test of independence are greater than 0.05.

Table 1.0: Chisquare test of independence on employment and use of job boards

Pearson’s Chi-squared test

data: table(Mydata$SOJobs, Mydata$Employment)

X-squared = 193.05, df = 10, p-value < 2.2e-16

Table 1.1: Chisquare test of independence on code as a hobby and use of job boards

Pearson’s Chi-squared test

data: table(Mydata$SOJobs, Mydata$Hobbyist)

X-squared = 4.6969, df = 2, p-value = 0.09552

Table 1.3: Chisquare test of independence on opensourcer and use of job boards

Pearson’s Chi-squared test

data: table(Mydata$SOJobs, Mydata$OpenSourcer)

X-squared = 110.01, df = 6, p-value < 2.2e-16

Discussion

Table 1.4: Random forest model

mtry Accuracy Kappa AccuracySD KappaSD
1 1 0.5468891 0.00000000 0.0005534559 0.000000000
2 2 0.5543406 0.02660861 0.0037644631 0.009004333
3 3 0.5604795 0.07642199 0.0096809944 0.022069815
4 4 0.5679279 0.11320100 0.0106714702 0.026607434
5 5 0.5556618 0.09858260 0.0142558528 0.023421394
6 6 0.5569730 0.10273021 0.0086065024 0.011295122
7 7 0.5508357 0.09678513 0.0061475960 0.022134218
8 8 0.5552224 0.10733549 0.0122102559 0.016416118
9 9 0.5552186 0.10920201 0.0102294750 0.013260776
10 10 0.5530256 0.10697816 0.0083220774 0.022922332

Fig 1.4: Model accuracy

The chi-square test of independence performed on the explanatory variables showed a statistically significant association. A random forest classification model was fitted to identify whether the association is statistically enough to predict the student status. Data was divided into two parts, the training and the validation data in the ratio of 70% to 30%.

When the model was applied for prediction, an accuracy level of 54% was achieved. Further tuning the parameters, the best model accuracy was 54%. Statistically, the model was significant as the accuracy level was above the 50% level. This indicates that the respondent’s response to employment, opensource, opensourcer, and Hobbyist provides enough evidence to predict the response to job boards’ use.

Prediction results

Accuracy : 0.5449

95% CI : (0.4996, 0.5896)

No Information Rate : 0.5265

P-Value [Acc > NIR] : 0.221

Kappa : 0.0773

Mcnemar’s Test P-Value : <2e-16

Recommendations for Future Research

For future researchers and authors, performing a random forest on various variables, for instance, more than four variables, would be recommended to identify the significant variables in predicting student’s status. Additionally, other variables like education, career field, and satisfaction would set some insights in the respondent’s responses.

Secondly, analyzing various countries to identify whether there exist any variations in the results as per the country of residence. Other than the random forest model, other researchers and scholars would also be recommended to fit other statistical models such as a logistic regression model, neural network, etc.

Conclusion

There is a significant relationship between respondents’ response to job boards and respondents’ response to employment, Hobbyist, Opensource, and Opensourcer. The association is significant enough to predict how respondents would respond to the student’s status. A random forest classification model effectively models the association of student status and its respective explanatory variables.

Most of the respondents were employed full time, representing a total of 1394 respondents. Interestingly, the respondents who coded for hobby were significantly more as compared to those who coded out of a hobby. Those who coded as hobby represented a total of 1422 respondents.

References

Bickel, P. J., & Lehmann, E. L. (2012). Descriptive statistics for nonparametric models I. Introduction. In Selected Works of EL Lehmann (pp. 465-471). Springer, Boston, MA.

Bryman, A. (2016). Social research methods. Oxford university press.

Hennink, M., Hutter, I., & Bailey, A. (2020). Qualitative research methods. SAGE Publications Limited.

Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22.

Rodriguez-Galiano, V. F., Ghimire, B., Rogan, J., Chica-Olmo, M., & Rigol-Sanchez, J. P. (2012). An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS Journal of Photogrammetry and Remote Sensing, 67, 93-104.

Stack overflow annual development survey (2019). https://insights.stackoverflow.com/survey/

Pssst… we can write an original essay just for you.

Remember! This is just a sample.

Save time and get your custom paper from our expert writers