Ryan Dunn, Maria Suplee, and Camille Veselka
As gender becomes an increasingly relevant topic to society it becomes necessary to examine how gender affects the way in which humans make choices and form preferences in daily life. While, many hold assumptions and opinions on female versus male preferences it is essential to determine, which of these assumptions are accurate and which are rooted in stereotype and bias. We seek to determine how gender actually affects lifestyle and preferences by using statistical evidence as opposed to long held assumptions. This pursuit will lead to a better understanding of how gender affects lifestyles and the choices humans make every day. With this information we will use a machine learning algorithm to predict gender based on self report responses pertaining to lifestyle and preferences. This algorithm could have potential benefits to private enterprise (creating more targeted marketing), society (assisting in gender based policy decisions) and data science (assisting future researchers in gender prediction to interpolate missing data and gender studies).
The data set we are using is a collection of data from a survey conducted on 1010 slovenian young people (ages 16 to 30) by a group students enrolled in a statistics class at Comenius University in Bratislava [2]. This data was found on Kaggle and comes with a few important considerations. First of all, the data was self reported, because of potential for participants to not be entirely honest or answer questions randomly we will be looking for outliers in the data. Considering this will be essential when going looking at results of regressions because statistics of an OLS regression can be heavily influenced by the inclusion of outliers. It will also be essential to remember that because we are looking at human reported data our adjusted R-squared may be lower than usual (as is common when modeling human behaviour) [4]. Secondly, the data was collected using a survey in Slovenia and translations to English may impact our interpretation of the results. Finally, the data is specific to slovenia people ages 16 to 30, however, because the existence of gender does not vary by country we believe we can apply this data on a more universal scale
You will need python3 installed, an internet connection, and the following libraries imported into your python notebook:
# import packages we will need
!pip install ggplot
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn import tree
from sklearn.model_selection import cross_val_score
import copy
import seaborn as sns
from ggplot import *
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.neighbors import NearestNeighbors
To get started download the young people data set from Kaggle [1]. Unzip the data set and put the “responses.csv” file in the same folder as your python notebook. Now read the csv.
# read in the data
responses = pd.read_csv("responses.csv")
In order to tidy data replace all np.NaN with False.
# prepare some dataframes for later
young = responses.copy(deep=True)
no_gender = responses.copy(deep=True)
no_gender = responses.replace(np.NaN, False)
no_gender = no_gender.loc[no_gender.Gender == False]
# drop rows without responces in rows we are interested in, aka respondants who did not report gender
responses = responses.replace(np.NaN, False)
responses.head()
# display more information about responses
responses.describe()
Next organize the data into categories that mirror the categories described in the data description on Kaggle, including music, movies, hobbies, phobias, health, opinions, spending, and demographics. This will make using the data and interpreting the data simpler and more efficient.
#split responces into categories according to the original questionnaire
music = responses.filter(["Music", "Slow songs or fast songs", "Dance", "Folk", "Country", "Classical music", "Musical", "Pop", "Rock", "Metal or Hardrock", "Punk", "Hiphop, Rap", "Reggae, Ska", "Swing, Jazz", "Rock n roll", "Alternative", "Latino", "Techno, Trance", "Opera"], axis=1)
movies = responses.filter(["Movies", "Horror", "Thriller", "Comedy", "Romantic", "Sci-fi", "War", "Fantasy/Fairy tales", "Animated", "Documentary", "Western", "Action"])
hobbies = responses.filter(["History", "Psychology", "Politics", "Mathematics", "Physics", "Internet", "PC", "Economy Management", "Biology", "Chemistry", "Reading", "Geography", "Foreign languages", "Medicine", "Law", "Cars", "Art exhibitions", "Religion", "Countryside, outdoors", "Dancing", "Musical instruments", "Writing", "Passive sport", "Active sport", "Gardening", "Celebrities", "Shopping", "Science and technology", "Theatre", "Fun with friends", "Adrenaline sports", "Pets"])
phobias = responses.filter(["Flying", "Storm", "Darkness", "Heights", "Spiders", "Snakes", "Rats", "Ageing", "Dangerous dogs", "Fear of public speaking"])
health = responses.filter(["Smoking", "Alcohol", "Healthy eating"])
opinions = responses.filter(["Daily events", "Prioritising workload", "Writing notes", "Workaholism", "Thinking ahead", "Final judgement", "Reliability", "Keeping promises", "Loss of interest", "Friends versus money", "Funniness", "Fake", "Criminal damage", "Decision making", "Elections", "Self-criticism", "Judgment calls", "Hypochondria", "Empathy", "Eating to survive", "Giving", "Compassion to animals", "Borrowed stuff", "Loneliness", "Cheating in school", "Health", "Changing the past", "God", "Dreams", "Charity", "Number of friends", "Punctuality", "Lying", "Waiting", "New environment", "Mood swings", "Appearence and gestures", "Socializing", "Achievements", "Responding to a serious letter", "Children", "Assertiveness", "Getting angry", "Knowing the right people", "Public speaking", "Unpopularity", "Life struggles", "Happiness in life", "Energy levels", "Small - big dogs", "Personality", "Finding lost valuables", "Getting up", "Interests or hobbies", "Parents' advice", "Questionnaires or polls", "Internet usage"])
spending = responses.filter(["Finances", "Shopping centres", "Branded clothing", "Entertainment spending", "Spending on looks", "Spending on gadgets", "Spending on healthy eating"])
demographics = responses.filter(["Age", "Height", "Weight", "Number of siblings", "Gender", "Left - right handed", "Education", "Only child", "Village - town", "House - block of flats"])
First, to visualize the variation in the data, plot the gender and age distribution of responses using a violin plot.
# display information about responses
plotdata = responses.copy(deep=True)
plotdata["Gender"] = pd.factorize(plotdata['Gender'])[0]
print("Number of Responces: "+str(len(plotdata)))
print("Females: "+str(len(plotdata.loc[plotdata.Gender == 0])))
print("Males: "+str(len(plotdata.loc[plotdata.Gender == 1])))
print("Gender withheld: "+str(len(plotdata.loc[plotdata.Gender == 2])))
# create plot of spread of ages and genders of respondants
ggplot(aes(x='Gender', y='Age'), data=plotdata) +\
geom_violin() +\
labs(title="Gender and Age Distribution of Responses",x = "Gender",y = "Age")
0 - Female
1 - Male
2 - Gender withheld
From the violin plot we can see that the majority of responses with a specified gender are in the 16 to 30 year old range. So 16 to 30 year old men and women are about equally represented.
Next, in order to interpret the data set and determine how each of the 150 columns relate to each other it is is essential to find a way to visualize these relationships.
To do this, look to a visualization method that can be found in another analysis of this data set done by Jekaterina Kokatjuhha[3]. This visualization displays the correlation coefficient of every column related to a variable of interest. The variable of interest here being gender. The function to create this visualization is found below.
# our imported function
def do_ploting(x, y, figsize):
fig, ax = plt.subplots(figsize= figsize)
ax.set_title("Correlation coefficient of the variables")
sns.barplot(x=x, y=y, ax = ax)
ax.set_ylabel("Correlation coefficients")
def correlation_plot(var_of_interest, df_main, mapping, figsize = (10,30)):
def calc_corr(var_of_interest, df, cols, figsize):
lbls = []
vals = []
for col in cols:
lbls.append(col)
vals.append(np.corrcoef(df[col], df[var_of_interest])[0,1])
corrs = pd.DataFrame({'features': lbls, 'corr_values': vals})
corrs = corrs.sort_values(by='corr_values')
do_ploting(corrs.corr_values, corrs['features'], figsize)
return corrs
#imputing the set
df = copy.deepcopy(df_main)
df.replace(mapping, inplace = True)
mean_values = df.mean(axis=0)
df.fillna(mean_values, inplace=True)
#correlating non-categorical varibales
cols_floats = [col for col in df.columns if df[col].dtype!='object']
cols_floats.remove(var_of_interest)
corrs_one = calc_corr(var_of_interest, df, cols_floats, figsize)
#correlating categorical variables
cols_cats = [col for col in df.columns if df[col].dtype=='object']
if cols_cats:
df_dummies = pd.get_dummies(df[cols_cats])
cols_cats = df_dummies.columns
df_dummies[var_of_interest] = df[var_of_interest]
corrs_two = calc_corr(var_of_interest, df_dummies, cols_cats, (5,10))
else:
corrs_two = 0
return [corrs_one, corrs_two]
To display the visualization call the function:
var_of_interest = 'Gender'
# create graphs of correlation coefficients for gender
mapping = {var_of_interest: {"female": 0, "male": 1 }}
young.dropna(subset=[var_of_interest], inplace=True)
# to be able to use hue parameter for better comparison in seaborn
young["all"] = ""
corrs_area = correlation_plot(var_of_interest, young, mapping)
plt.show()
After seeing which values seemed to have the strogest correlation with gender, we want to see how powerful these relationships really are. To do this, we construct different decision trees and examine their accuracy in predicting a respondent's gender based on their answers to other questions.
A decision tree is essentialy a graph where each internal node represents a selection of possible values of a feature. Each edge stemming from these nodes represents one of the values. Each leaf then is the predicted value for the feature the tree is built around. The tree can predict the value of a certain feature by using the features of the input sample to travel down the tree. More information about decision trees can be found here
We choose to build trees to predict gender based on the variables in the different catergories of responses from the survey. To keep us from getting repetitive, we won't include every tree we tried, but a few of the more relavent. Some of our trees really aren't that great at predicting. Our tree based on the health responses yields the reported gender about half the time, according to the accuracy scores.
health['Gender'] = responses['Gender']
health = health.replace(np.NaN, False)
for column in health.columns:
health = health[health[column] != False]
features = health.columns[[0,1,2]]
# change categorical variables to integer values
mapping1 = {'Smoking': {'never smoked': 0, 'tried smoking': 1, 'former smoker' : 2, 'current smoker' : 3}}
mapping2 = {'Alcohol': {'never': 0, 'social drinker': 1, 'drink a lot' : 2}}
health.replace(mapping1, inplace = True)
health.replace(mapping2, inplace = True)
y = pd.factorize(health['Gender'])[0]
health
# # creat a Decision Tree Classifier
clf = tree.DecisionTreeClassifier()
# # this will do a 10 fold cross validation
scores_tc = cross_val_score(clf, health[features], y, cv=10)
scores_tc
Features from the spending category give us about the same degree of accuracy:
spending['Gender'] = responses['Gender']
spending = spending.replace(np.NaN, False)
for column in spending.columns:
spending = spending[spending[column] != False]
features = spending.columns[[0,1,2,3,4,5,6]]
y = pd.factorize(spending['Gender'])[0]
health
# # creat a Decision Tree Classifier
clf = tree.DecisionTreeClassifier()
# # this will do a 10 fold cross validation
scores_tc = cross_val_score(clf, spending[features], y, cv=10)
scores_tc
Although some runthroughs produce better results than others, these trees don't seem to predict accurately enough to be useful to us. However, other categories give much better results. Building a tree off of the opinions category predicts gender correctly over 70% of the time.
opinions['Gender'] = responses['Gender']
opinions = opinions.replace(np.NaN, False)
for column in opinions.columns:
opinions = opinions[opinions[column] != False]
features = opinions.columns[[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56]]
# change categorical variables to integer values
mapping1 = {'Punctuality': {'i am often early': 0, 'i am always on time': 1, 'i am often running late' : 2}}
mapping2 = {'Lying': {'never': 0, 'only to avoid hurting someone': 1, 'sometimes' : 2, 'everytime it suits me' : 3}}
mapping3 = {'Internet usage': {'no time at all': 0, 'less than an hour a day': 1, 'few hours a day' : 2, 'most of the day' : 3}}
opinions.replace(mapping1, inplace = True)
opinions.replace(mapping2, inplace = True)
opinions.replace(mapping3, inplace = True)
y = pd.factorize(opinions['Gender'])[0]
# # creat a Decision Tree Classifier
clf = tree.DecisionTreeClassifier()
# # this will do a 10 fold cross validation
scores_tc = cross_val_score(clf, opinions[features], y, cv=10)
scores_tc
Hobbies turn out surprisingly to be the best predictors of gender from our dataset, giving us accuracy scores from 70% up to 85% per runthrough.
hobbies['Gender'] = responses['Gender']
hobbies = hobbies.replace(np.NaN, False)
for column in hobbies.columns:
hobbies = hobbies[hobbies[column] != False]
features = hobbies.columns[[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]]
y = pd.factorize(hobbies['Gender'])[0]
# # creat a Decision Tree Classifier
clf = tree.DecisionTreeClassifier()
# # this will do a 10 fold cross validation
scores_tc = cross_val_score(clf, hobbies[features], y, cv=10)
scores_tc
This tree actually gives us something we can work with! Before moving forward, we want to see if we can make an even better decision tree just made up of the features with the high coefficient correlations. We pick our "best" 9 features and see what the tree yields.
best = pd.DataFrame()
best['Gender'] = responses['Gender']
best['Alcohol'] = health['Alcohol']
best['Romantic'] = movies['Romantic']
best['Shopping'] = hobbies['Shopping']
best['Reading'] = hobbies['Reading']
best['PC'] = hobbies['PC']
best['Cars'] = hobbies['Cars']
best['Internet'] = hobbies['Internet']
best['Internet usage'] = opinions['Internet usage']
best['Punctuality'] = opinions['Punctuality']
best = best.replace(np.NaN, False)
for column in best.columns:
best = best[best[column] != False]
features = best.columns[[1,2,3,4,5,6,7,8,9]]
y = pd.factorize(best['Gender'])[0]
# # creat a Decision Tree Classifier
clf = tree.DecisionTreeClassifier()
# # this will do a 10 fold cross validation
scores_tc = cross_val_score(clf, best[features], y, cv=10)
scores_tc
Unfortunately, this tree isn't any better at predicting gender than our trees based solely on hobbies and opinions, respectively. Since we still think using the features with the largest correlation coefficiants would be useful, we try another strategy: making a regression model from these features.
responses = responses.rename(columns={'Life struggles': 'Struggles'})
responses = responses.replace(np.NaN, False)
responses = responses[responses['Gender'] != False]
responses = responses[responses['Struggles'] != False]
responses = responses[responses['Alcohol'] != False]
responses['Gender'] = responses['Gender'].replace('female', 0)
responses['Gender'] = responses['Gender'].replace('male', 1)
gender = responses['Gender']
b = responses['Struggles']
alc = responses['Alcohol']
struggles = np.array([[s] for s in b])
# create linear regression from features with high correlation coefficients
fit = smf.ols(formula = 'gender ~ struggles + Romantic + Shopping + Reading + Spiders + Alcohol + Internet', data = responses).fit()
fit.model.fit()
fit.summary()
Now that we've made the model, let's look at a breakdown of the fit.
# create anova table from our regression model
model1 = sm.stats.anova_lm(fit, typ = 2)
model1
Well, as we can see here, it's no perfect model. But we do gain some insights from the values present. Our p-values for the majority of the features are below .05, allowing us to reject the null hypothesis that they are not correlated with gender. This confirms our hunch from examining the correlation coefficient graphs and running our decision trees. Our F-value isn't great, suggesting the the relationship isn't that strong, but with good r-squared values, which implies points lie close to the fitted regression line, we've got something to work with. Objectively, our r-squared value is not excellent but because we are working with human reported data this is to be expected and our value is considered acceptable.
Because of the collection method used to collect this data it is possible that outliers exist. These outliers can interfere with the accuracy of models derived from machine learning algorithms and regressions. To combat this problem we use Kth Nearest Neighbour (KNN) algorithms. KNN essentially constructs a graph where each row of the data is a vertex and an edge is created to each of its nth nearest neighbors. Each edge also has a weight that is determined by the euclidean distance between a vertex and its neighbor. More information on KNN can be found here.
Bellow we implement three metrics for determining if a row is an outlier, Mean Distance, Max Distance, and In Degree. Mean Distance will classify a row as an outlier if the average distance is above a given threshold. Max Distance will classify a row as an outlier if the maximum distance is above a given threshold. In Degree will classify a row as an outlier if the in degree of that row(ie. The number of other vertices on the graph that have edges to the row) is below a given threshold. These are the three methods of outlier classification addressed in [5].
First we use sklearn KNN function to find the 10 nearest neighbors for each of responses. This will return distances, the euclidean distances to each neighbor, and indices, the actual index of each neighbors.
# find nearest neighbors
knndata = responses.copy(deep=True)
knndata['Alcohol'] = pd.factorize(knndata['Alcohol'])[0]
knndata_array = knndata.as_matrix(['Gender','Struggles','Romantic','Reading', 'Spiders', 'Alcohol', 'Internet'])
nbrs = NearestNeighbors(n_neighbors=10, algorithm='ball_tree').fit(knndata_array)
distances, indices = nbrs.kneighbors(knndata_array)
Now we will calculate each of the three metrics for outliers. For mean distance and max distance we use a threshold of 1.5 so any row with a value greater is considered an outlier. For in degree we use a threshold of 2 so anything with an in degree less than 2 is considered an outlier. These threshold values are somewhat arbitrary but based on the plots below it is apparent that not too many values will be excluded with these values.
mean_distance_arr = []
# calculate the mean distance
for d in distances:
mean_distance_arr.append(np.mean(d))
outliers = list(map(lambda x: x<1.5, mean_distance_arr))
responses_mean_distance = responses.loc[outliers]
max_distance_arr = []
# calculate the max distance
for d in distances:
max_distance_arr.append(np.max(d))
outliers = list(map(lambda x: x<1.5, max_distance_arr))
responses_max_distance = responses.loc[outliers]
responses.index = list(range(0,len(responses)))
indegree_arr = []
# calculate the Indegree
for index, _ in responses.iterrows():
curr_indegree = 0
for vertex in indices:
for edge in vertex:
if index == edge:
curr_indegree = curr_indegree + 1
indegree_arr.append(curr_indegree)
outliers = list(map(lambda x: x > 2, indegree_arr))
responses_indegree = responses.loc[outliers]
# plot results of mean distance
mean_distance_arr.sort()
plt.bar(responses.index, mean_distance_arr)
plt.plot(responses.index, np.ones(len(responses))*1.5,'C2')
plt.show()
# redo ols regression
fit = smf.ols(formula = 'Gender ~ Struggles + Romantic + Shopping + Reading + Spiders + Alcohol + Internet', data = responses_mean_distance).fit()
fit.model.fit()
fit.summary()
# plot results of max distance
max_distance_arr.sort()
plt.bar(responses.index, max_distance_arr)
plt.plot(responses.index, np.ones(len(responses))*2.1,'C2')
plt.show()
# redo ols regression
fit = smf.ols(formula = 'Gender ~ Struggles + Romantic + Shopping + Reading + Spiders + Alcohol + Internet', data = responses_max_distance).fit()
fit.model.fit()
fit.summary()
# plot results of in degree
indegree_arr.sort()
plt.bar(responses.index, indegree_arr)
plt.plot(responses.index, np.ones(len(responses))*2,'C2')
plt.show()
# redo ols regression
fit = smf.ols(formula = 'Gender ~ Struggles + Romantic + Shopping + Reading + Spiders + Alcohol + Internet', data = responses_indegree).fit()
fit.model.fit()
fit.summary()
Bellow we graph the threshold value with a red line and the values for each row as a bar chart. This is done for each metric. Then re-derive the ols model using the new dataset where we dropped the outliers determined by each metric.
We found that we would continue our analysis with the results from the mean distance metric because it gave us good R-squared, F-statistic, and P-values. So now we will retrain the hobbies tree classifier from before with our updated dataset that has outliers removed.
# re-train hobbies tree classifier
hobbies = responses_max_distance.filter(["History", "Psychology", "Politics", "Mathematics", "Physics", "Internet", "PC", "Economy Management", "Biology", "Chemistry", "Reading", "Geography", "Foreign languages", "Medicine", "Law", "Cars", "Art exhibitions", "Religion", "Countryside, outdoors", "Dancing", "Musical instruments", "Writing", "Passive sport", "Active sport", "Gardening", "Celebrities", "Shopping", "Science and technology", "Theatre", "Fun with friends", "Adrenaline sports", "Pets"])
hobbies['Gender'] = responses_max_distance['Gender']
features = hobbies.columns[[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]]
y = pd.factorize(hobbies['Gender'])[0]
# # creat a Decision Tree Classifier
clf = tree.DecisionTreeClassifier()
clf.fit(hobbies[features],y)
# # this will do a 10 fold cross validation
scores_tc = cross_val_score(clf, hobbies[features], y, cv=10)
scores_tc
Finally we use this new classifier to attempt to assign a gender to participants who did not provide a gender response.
# predict genders of 6 respondants who did not report gender
arr = clf.predict(no_gender[features])
arr
We now have the predicted genders of the six respondants who did not report gender. 0 here corresponds to female, and 1 to male. Although we have no way to confirm the genders of these respondants, upon examining other features that were not included in our model, the predictions seem reasonable. Some of the features we examined to assess our predictions were those that had high correlation coefficients in relation to gender, such as 'Alcohol' and 'Life struggles'. Others, such as 'Weight' and 'Height, also seem to correctly correspond to the estimated genders, but this lends itself to common sense.
We are now able to predict gender based on a person’s hobby using a tree classifier algorithm. While the data is from Slovenia we believe we can use and apply the created model on a global scale because gender is not specific to region. However, when applying our model there are factors we need to consider. Predominantly that our data and thus our model may contain gender bias in responses. Because the data was self-reported people could have responded without ignoring their own personal biases. This must be considered when using our model. We should also take into consideration cultural factors that may influence gender identity, and how that may impact how they form interests in different areas of life. Even with this consideration, it is interesting to examine the relationship between personal preference and gender identity in Slovenian youth.