01
Apr 2017
Reproducing "What Programming Languages Are Used Most on Weekends?"
Tags |
On Computer Technology
I have a confession to make. I don't feel that I am a very naturally creative person. I don't mean that my mind is totally empty when it comes to thinking of new ideas, it's just that I find myself entering a state of helplessness whenever I try too hard to come up with totally novel ideas that will make for good publishing material.
I am looking to transition to a more machine learning oriented role for my next job. I need practice, lots of it. I lack ideas. Why not replicate what other people have done and get some hands-on practice in the process?
A few weeks ago, I read https://stackoverflow.blog/2017/02/07/what-programming-languages-weekends/ . It was pretty interesting. Let's try to replicate their results here.
NOTE: The dataset may be updated from time to time. Hence you may not get the same results as Julia Silge and myself. In particular, there's one part below where we use the statsmodels
glm
function and there was a PerfectSeparationError
. You may or may not get that.
As much as possible, I am trying to avoid loading all the data into memory because my machine is not very powerful. So there are certain operations that can probably be done easily using libraries but I have avoided doing that.
The stacklite.zip
file on Kaggle is 446MB. After decompressing, there are 2 files, questions.csv
and question_tags.csv
. questions.csv
is 862MB and questions_tags.csv
is 844MB. While it is possible to load the entire dataset into memory, there's no need to. I thought of loading the data into MySQL but then thought again and believed there was no need to.
It is pretty obvious that questions.csv
stores questions data. The first 6 lines (including the header) look like this:
Id,CreationDate,ClosedDate,DeletionDate,Score,OwnerUserId,AnswerCount
1,2008-07-31T21:26:37Z,NA,2011-03-28T00:53:47Z,1,NA,0
4,2008-07-31T21:42:52Z,NA,NA,458,8,13
6,2008-07-31T22:08:08Z,NA,NA,207,9,5
8,2008-07-31T23:33:19Z,2013-06-03T04:00:25Z,2015-02-11T08:26:40Z,42,NA,8
9,2008-07-31T23:40:59Z,NA,NA,1410,1,58
Pretty self explanatory.
The contents of question_tags.csv
however, stumped me a bit:
Id,Tag
1,data
4,c#
4,winforms
4,type-conversion
4,decimal
4,opacity
6,html
6,css
6,css3
6,internet-explorer-7
Initially, I thought it was a list of unique tags. But after checking out Julia Silge's kernel on kaggle and upon closer inspection, I realized the Id
column represents the question id and the Tag
column is a tag for that question.
from collections import Counter, defaultdict
import datetime
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
_ASKED_ON_WEEKDAY = 0
_ASKED_ON_WEEKEND = 1
def _is_weekday(date_string):
return datetime.datetime.strptime(
date_string,
"%Y-%m-%dT%H:%M:%SZ"
).weekday() < 5
This is actually something we came back to do after some steps in.
For questions.csv
, we check that there are no duplicate question ids.
For question_tags.csv
, we check that there are no duplicate tags for each question.
question_ids = set()
questions_csv_clean = True
with open("questions.csv", "r") as f:
# skip header
f.readline()
for line in f:
str_question_id, _, _, _, _, _, _ = line.strip().split(",")
question_id = int(str_question_id)
if question_id in question_ids:
print("Duplicate question id: {}".format(question_id))
questions_csv_clean = False
else:
question_ids.add(question_id)
print("questions.csv clean? {}".format(questions_csv_clean))
del question_ids
question_tags = {}
question_tags_csv_clean = True
nr_errors = 0
with open("question_tags.csv", "r") as f:
# skip header
f.readline()
for line in f:
str_question_id, tag = line.strip().split(",")
question_id = int(str_question_id)
if question_id not in question_tags:
question_tags[question_id] = set([tag])
elif tag in question_tags[question_id]:
print("Duplicate tag `{}` for question id {}".format(
tag, question_id
))
question_tags_csv_clean = False
nr_errors += 1
if nr_errors >= 10:
break
else:
question_tags[question_id].add(tag)
print("question_tags.csv clean? {}".format(question_tags_csv_clean))
Uh oh. The question_tags.csv
file isn't clean. There are a lot more duplicate tags than what I've shown here but going through the entire data set causes my system to run out of memory and to save space I'm not gonna show you here.
Before we clean up the data, let us see if the question ids in question_tags.csv
come in non-decreasing order. It will simplify the cleaning up work if that's the case.
previous_question_id = -10**9
question_ids_non_decreasing = True
with open("question_tags.csv", "r") as f:
# skip header
f.readline()
for line in f:
str_question_id, _ = line.strip().split(",")
question_id = int(str_question_id)
if question_id < previous_question_id:
print("question ids do not come in non-decreasing order. In particular, {} comes after {}".format(
question_id, previous_question_id
))
question_ids_non_decreasing = False
break
else:
previous_question_id = question_id
print("Are question ids in question_tags.csv non-decreasing? {}".format(
question_ids_non_decreasing
))
Nice. This vastly simplifies the clean-up work in that we only need to store all tags for the current question and upon encountering a larger question id, we write the tags for the current question to the new csv file. Rinse and repeat.
question_tags.csv
¶NOTE: The code in the following cell creates a question_tags_clean.csv
file on your system.
with open("question_tags_clean.csv", "w") as out_f:
out_f.write("Id,Tag\n")
with open("question_tags.csv", "r") as in_f:
# skip header
in_f.readline()
current_question_id = None
current_question_tags = set()
for line in in_f:
str_question_id, tag = line.strip().split(",")
question_id = int(str_question_id)
if question_id != current_question_id:
# flush previous question's tags to file
for t in current_question_tags:
out_f.write("{},{}\n".format(current_question_id, t))
current_question_id = question_id
current_question_tags = set([tag,])
else:
current_question_tags.add(tag)
if current_question_id is not None:
for tag in current_question_tags:
out_f.write("{},{}\n".format(current_question_id, tag))
Let us first determine how many unique tags there are.
unique_tags = set()
with open("question_tags_clean.csv", "r") as f:
# Skip header line
f.readline()
for line in f:
_, tag = line.split(",")
unique_tags.add(tag)
len(unique_tags)
Now, let us extract all the tags, along with their associated questions, then only keep those tags with over 20,000 questions.
tags_to_questions = defaultdict(set)
with open("question_tags_clean.csv", "r") as f:
# Skip header line
f.readline()
for line in f:
question_id, tag = line.strip().split(",")
tags_to_questions[tag.strip()].add(int(question_id))
over20k_tags_to_questions = {}
for tag, questions_set in tags_to_questions.items():
if len(questions_set) >= 20000:
over20k_tags_to_questions[tag] = questions_set
del tags_to_questions
print("# tags with >= 20,000 questions: {}".format(len(over20k_tags_to_questions)))
Let's dump this subset of data to a file. We will shutdown this Jupyter notebook and load that smaller data set back in.
with open("over20k_tags_to_questions.txt", "w") as f:
for tag, questions_set in over20k_tags_to_questions.items():
f.write("{}:{}\n".format(tag, ",".join(map(str, questions_set))))
Let's load the subset of data back in:
over20k_tags_to_questions = {}
with open("over20k_tags_to_questions.txt", "r") as f:
for line in f:
tag, questions = line.split(":")
over20k_tags_to_questions[tag] = set(map(int, questions.split(",")))
nr_questions_on_weekdays = 0
nr_questions_on_weekends = 0
with open("questions.csv", "r") as f:
# Skip header line
f.readline()
for line in f:
_, creation_date, _, _, _, _, _ = line.strip().split(",")
if creation_date != "NA":
if _is_weekday(creation_date):
nr_questions_on_weekdays += 1
else:
nr_questions_on_weekends += 1
print("# questions on weekdays: {}".format(nr_questions_on_weekdays))
print("# questions on weekends: {}".format(nr_questions_on_weekends))
Those are quite different numbers from those in Julia Silge's post:
Overall, this includes 10,451,274 questions on weekdays and 2,132,073 questions on weekends.
Probably the data set was updated some time after the post was written.
Something struck me for the next step. What does the author mean by relative frequency? Relative to what? Turns out she included the definition:
Instead, let’s explore which tags made up a larger share of weekend questions than they did of weekday questions, and vice versa.
So we're gonna go through all the tags with at least 20,000 questions, find out how many of the questions for each tag were posted on weekdays and weekends, then do a comparison.
# Get all question ids for tags with at least 20,000 questions
qns_with_some_tag_over_20k_qns = set()
i = 0
for questions_set in over20k_tags_to_questions.values():
qns_with_some_tag_over_20k_qns.update(questions_set)
# For questions we are interested in, get whether they are asked during
# weekday or weekend
qoi_to_wday_wend = dict.fromkeys(
qns_with_some_tag_over_20k_qns,
-1
)
with open("questions.csv", "r") as f:
# skip header
f.readline()
for line in f:
str_question_id, creation_date, _, _, _, _, _ = line.strip().split(",")
question_id = int(str_question_id)
if question_id in qoi_to_wday_wend and creation_date != "NA":
if datetime.datetime.strptime(
creation_date,
"%Y-%m-%dT%H:%M:%SZ"
).weekday() < 5:
qoi_to_wday_wend[question_id] = _ASKED_ON_WEEKDAY
else:
qoi_to_wday_wend[question_id] = _ASKED_ON_WEEKEND
# Now go through all the tags with over 20,000 questions and count
# the # of questions asked during weekdays vs. weekends
tag_to_nr_qns_on_wday = Counter()
tag_to_nr_qns_on_wend = Counter()
for tag, questions_set in over20k_tags_to_questions.items():
tag_to_nr_qns_on_wday[tag] = tag_to_nr_qns_on_wend[tag] = 0
for question_id in questions_set:
if qoi_to_wday_wend[question_id] == _ASKED_ON_WEEKDAY:
tag_to_nr_qns_on_wday[tag] += 1
elif qoi_to_wday_wend[question_id] == _ASKED_ON_WEEKEND:
tag_to_nr_qns_on_wend[tag] += 1
# Get top 20 tags with higher relative frequency of questions asked during weekdays
_INF = 10**9
wday_rel_freq_list = []
wend_rel_freq_list = []
total_nr_wday_qns = sum(tag_to_nr_qns_on_wday.values())
total_nr_wend_qns = sum(tag_to_nr_qns_on_wend.values())
for tag in over20k_tags_to_questions:
wday_nr_qns = tag_to_nr_qns_on_wday[tag]
wend_nr_qns = tag_to_nr_qns_on_wend[tag]
wday_freq = wday_nr_qns / total_nr_wday_qns
wend_freq = wend_nr_qns / total_nr_wend_qns
if wend_freq == 0:
wday_rel_freq_list.append((tag, _INF,))
else:
wday_rel_freq_list.append((tag, wday_freq / wend_freq,))
if wday_freq == 0:
wend_rel_freq_list.append((tag, _INF,))
else:
wend_rel_freq_list.append((tag, wend_freq / wday_freq,))
wday_rel_freq_list.sort(key=lambda t: t[1])
print(wday_rel_freq_list[-20:])
wend_rel_freq_list.sort(key=lambda t: t[1])
print(wend_rel_freq_list[-20:])
wday_df = pd.DataFrame(
list(
map(
lambda t: [t[0], float(t[1]) * 100],
wday_rel_freq_list[-20:][::-1]
)
),
columns=["tag", "relative_frequency"],
)
wend_df = pd.DataFrame(
list(
map(
lambda t: [t[0], float(t[1]) * 100],
wend_rel_freq_list[-20:][::-1]
)
),
columns=["tag", "relative_frequency"],
)
fig, (ax1, ax2,) = plt.subplots(ncols=2, figsize=(15,15,))
sns.barplot(
x="relative_frequency",
y="tag", data=wday_df,
color=sns.xkcd_rgb["coral"],
ax=ax1,
)
sns.barplot(
x="relative_frequency",
y="tag",
data=wend_df,
color=sns.xkcd_rgb["teal"],
ax=ax2,
)
plt.subplots_adjust(wspace=0.2)
plt.xticks([0, 50, 100, 150, 200, 250, 300, 350, 400,])
plt.subplots_adjust(top=0.9)
fig.suptitle(
"Which tags have the biggest weekday/weekend differences?\nFor tags with more than 20,000 questions",
size=20,
)
ax1.set_title("Weekdays")
ax1.set_xlabel("")
ax1.set_ylabel("")
ax2.set_title("Weekends")
ax2.set_xlabel("")
ax2.set_ylabel("")
fig.text(0.5, 0.09, "Relative frequency", ha="center", size=18,);
The next part deals with tags with biggest decrease in weekend activity. Here, decrease means the change between years 2016 and 2008 (ignoring trends in between those years).
qns_with_some_tag_over_20k_qns__to__tags = defaultdict(set)
for tag, questions_set in over20k_tags_to_questions.items():
for question_id in questions_set:
qns_with_some_tag_over_20k_qns__to__tags[question_id].add(tag)
def _add_to_relevant_count(
tags_activity_dict,
question_id,
creation_date
):
for tag in qns_with_some_tag_over_20k_qns__to__tags[question_id]:
if tag not in tags_activity_dict:
tags_activity_dict[tag] = (0, 0,)
wday_cnt, wend_cnt = tags_activity_dict[tag]
if _is_weekday(creation_date):
tags_activity_dict[tag] = (wday_cnt + 1, wend_cnt,)
else:
tags_activity_dict[tag] = (wday_cnt, wend_cnt + 1,)
tags_with_over20k_qns_2008_activity = {}
tags_with_over20k_qns_2016_activity = {}
with open("questions.csv", "r") as f:
# skip header
f.readline()
for line in f:
str_question_id, creation_date, _, _, _, _, _ = line.strip().split(",")
question_id = int(str_question_id)
if question_id in qns_with_some_tag_over_20k_qns__to__tags:
str_year_of_creation = creation_date[:4]
if str_year_of_creation == "2008":
_add_to_relevant_count(
tags_with_over20k_qns_2008_activity,
question_id,
creation_date
)
elif str_year_of_creation == "2016":
_add_to_relevant_count(
tags_with_over20k_qns_2016_activity,
question_id,
creation_date
)
And... in the above step, my computer ran out of memory. Lol. Time to spin up an EC2 instance.
tags_wend_over_wday_ratio = []
tags_with_zero_count = 0
for tag in tags_with_over20k_qns_2008_activity:
if tag in tags_with_over20k_qns_2016_activity:
wday_2008, wend_2008 = tags_with_over20k_qns_2008_activity[tag]
wday_2016, wend_2016 = tags_with_over20k_qns_2016_activity[tag]
# Let's ignore tags that have 0 counts, because one of the ratios will be
# infinity
if wday_2008 != 0 and wend_2008 != 0 and wday_2016 != 0 and wend_2016 != 0:
tags_wend_over_wday_ratio.append(
(tag, wend_2008 / wday_2008, wend_2016 / wday_2016,)
)
tags_wend_over_wday_ratio.sort(
key=lambda t: t[1] - t[2]
)
print(tags_wend_over_wday_ratio[-10:])
print(sorted(tags_wend_over_wday_ratio, key=lambda t: t[2] - t[1])[-10:])
#print(tags_with_over20k_qns_2008_activity)
print(tags_with_over20k_qns_2008_activity["scala"])
print(tags_with_over20k_qns_2016_activity["scala"])
print(sorted(tags_wend_over_wday_ratio, key=lambda t: t[2])[-10:])
Looks very different from what's in Julia Silge's post. In fact, upon reading the R code in her Kaggle kernel, I found that our approach is completely different. Let's redo everything.
Turns out that in Julia Silge's kernel, she first filters out all questions whose deletion date is NA. Let's do that and drop unnecessary columns.
questions = pd.read_csv("questions.csv")
questions = questions.loc[questions["DeletionDate"].isnull(), :]
questions.pop("DeletionDate")
questions.pop("ClosedDate")
questions.pop("Score")
questions.pop("OwnerUserId")
questions.pop("AnswerCount")
questions.head()
Now we load all the tags for the questions, but only for those questions which have not been deleted. We do this by using the merge
method of pandas.DataFrame
.
question_tags_clean = pd.read_csv("question_tags_clean.csv")
question_tags_clean = question_tags_clean.merge(
questions,
on="Id",
how="inner",
)
question_tags_clean.pop("CreationDate");
We are only interested in tags which have over 20,000 questions. Time to do some filtering.
tags_with_counts = question_tags_clean.groupby("Tag").count()
tags_with_counts = tags_with_counts.reset_index()
tags_with_counts.columns = ["Tag", "Count"]
tags_with_counts = tags_with_counts[tags_with_counts.Count > 20000]
question_tags_clean = question_tags_clean.merge(
tags_with_counts,
on="Tag",
how="inner",
)
question_tags_clean.pop("Count")
del tags_with_counts
Let's obtain the set of questions with some tag that has over 20,000 questions.
questions_with_tag_over_20k = questions.merge(
pd.DataFrame({"Id": question_tags_clean["Id"].unique(),}),
on="Id",
how="inner",
)
questions_with_tag_over_20k["Weekday"] = \
questions_with_tag_over_20k["CreationDate"].apply(_is_weekday)
tag_wday_counts = question_tags_clean.merge(
questions_with_tag_over_20k,
on="Id",
how="inner"
).groupby(["Tag", "Weekday"]).count()
tag_wday_counts.head()
Let's get rid of the hierarchical index formed by groupby
.
tag_wday_counts = tag_wday_counts.reset_index()
tag_wday_counts.head()
The CreationDate
column is redundant. Let's remove it.
tag_wday_counts.pop("CreationDate")
tag_wday_counts.columns = ["Tag", "Weekday", "Count"]
tag_wday_counts.head()
We want something like spread
in tidyr. I saw on https://chrisalbon.com/python/pandas_long_to_wide.html that we can use pivot
.
tag_wday_counts = tag_wday_counts.pivot(
index="Tag",
columns="Weekday",
values="Count"
)
tag_wday_counts.head()
Let's remove the hierarchical index using reset_index
.
tag_wday_counts = tag_wday_counts.reset_index()
tag_wday_counts.head()
Let's rename the columns:
tag_wday_counts.columns=["Tag", "Weekend", "Weekday"]
tag_wday_counts.head()
To obtain the rate of questions asked during weekends and weekdays for each tag, we have to calculate the total number of questions asked during weekdays and weekends.
questions["Weekday"] = questions["CreationDate"].apply(_is_weekday)
x = questions.groupby("Weekday").count()
x.head()
nr_wday_qns = x.loc[True, "Id"]
nr_wend_qns = x.loc[False, "Id"]
x = tag_wday_counts.copy()
x["Weekday"] /= nr_wday_qns
x["Weekend"] /= nr_wend_qns
x[x["Weekday"] == 0].shape
x["WeekendOverWeekday"] = x["Weekend"] / x["Weekday"]
x = x.sort_values(by="WeekendOverWeekday", ascending=False,)
x[:16]
These are the exact same tags as that in the original post and in the same order. Now let's try getting at the highest weekday / weekend activity tags.
x[-16:][::-1]
Barring the incorrect numbers in the WeekendOverWeekday
column which can be corrected by taking reciprocals, the tags are the same as those in the original post and in the same order.
Now let's plot the bar charts.
highest_wend_plot_df = x[:16].loc[:, ["Tag", "WeekendOverWeekday"]]
highest_wend_plot_df["WeekendOverWeekday"] *= 100
highest_wday_plot_df = x[-16:].loc[:, ["Tag", "WeekendOverWeekday"]][::-1]
highest_wday_plot_df["WeekendOverWeekday"] = 1 / highest_wday_plot_df["WeekendOverWeekday"] * 100
fig, (ax1, ax2,) = plt.subplots(ncols=2, figsize=(12,12,))
sns.barplot(
x="WeekendOverWeekday",
y="Tag",
data=highest_wday_plot_df,
color=sns.xkcd_rgb["coral"],
ax=ax1,
)
sns.barplot(
x="WeekendOverWeekday",
y="Tag",
data=highest_wend_plot_df,
color=sns.xkcd_rgb["teal"],
ax=ax2,
)
plt.subplots_adjust(wspace=0.2)
plt.xticks([0, 50, 100, 150, 200, 250, 300,])
plt.subplots_adjust(top=0.9)
fig.suptitle(
"Which tags have the biggest weekday/weekend differences?\nFor tags with more than 20,000 questions",
size=20,
)
ax1.set_title("Weekdays")
ax1.set_xlabel("")
ax1.set_ylabel("")
ax2.set_title("Weekends")
ax2.set_xlabel("")
ax2.set_ylabel("")
fig.text(0.5, 0.07, "Relative frequency", ha="center", size=18,);
Nice. So that's one mystery solved. Now comes the harder part.
For the next part, we will need to compute for each tag in tags with over 20,000 questions:
And for each year, we need to compute the number of questions asked during weekends and weekdays. Let's do this first.
# Add year information to `questions`
questions["Year"] = questions["CreationDate"].apply(
lambda datestring: datestring[:4]
)
year_wday_counts = questions.copy()
year_wday_counts = year_wday_counts.groupby(
["Year", "Weekday"]
).count()
year_wday_counts = year_wday_counts.reset_index()
year_wday_counts.pop("Id")
year_wday_counts.rename(
columns={
"CreationDate": "Total",
},
inplace=True,
)
year_wday_counts.head()
Seems correct enough to me. Moving on.
Earlier we computed:
questions_with_tag_over_20k
: the set of all questions with some tag that has over 20,000 questions
tag_wday_counts
: the set of all tags where each tag has over 20,000 questions along with the breakdown of the number of questions asked during weekdays and weekends
question_tags_clean
: the cleaned data originally from question_tags.csv
Let's take a look at them.
questions_with_tag_over_20k.head()
tag_wday_counts.head()
question_tags_clean.head()
Time to add year information to questions_with_tag_over_20k
.
questions_with_tag_over_20k["Year"] = questions_with_tag_over_20k["CreationDate"].apply(
lambda datestring: datestring[:4]
)
Now let's breakdown the tags by year and counts during weekday, weekend.
tag_over_20k_year_wend_counts = questions_with_tag_over_20k.merge(
question_tags_clean,
how="left",
on="Id",
).groupby(["Year", "Tag", "Weekday"]).count()
tag_over_20k_year_wend_counts.head()
tag_over_20k_year_wend_counts.pop("CreationDate")
tag_over_20k_year_wend_counts.rename(
columns={"Id": "WeekendTotal"},
inplace=True,
)
tag_over_20k_year_wend_counts = tag_over_20k_year_wend_counts.reset_index()
tag_over_20k_year_wend_counts.head()
My pandas skill isn't so awesome. So I'll extract the weekday counts from this dataframe, drop the rows we extracted, then do a merge to insert the weekday counts.
weekday_counts = tag_over_20k_year_wend_counts.loc[
tag_over_20k_year_wend_counts["Weekday"] == True
]
weekday_counts.head()
# Drop weekday counts
tag_over_20k_year_wend_counts.drop(
tag_over_20k_year_wend_counts[
tag_over_20k_year_wend_counts["Weekday"] == True
].index,
inplace=True,
)
tag_over_20k_year_wend_counts.head()
tag_over_20k_year_wend_counts = tag_over_20k_year_wend_counts.merge(
weekday_counts,
how="inner",
on=["Year", "Tag",],
suffixes=("_l", "_r",),
)
tag_over_20k_year_wend_counts.head()
tag_over_20k_year_wend_counts.pop("Weekday_l")
tag_over_20k_year_wend_counts.pop("Weekday_r")
tag_over_20k_year_wend_counts.rename(
columns={
"WeekendTotal_l": "WeekendTotal",
"WeekendTotal_r": "WeekdayTotal",
},
inplace=True,
)
tag_over_20k_year_wend_counts.head()
Nice. Now we almost have what we want in tag_over_20k_year_wend_counts
and year_wday_counts
.
But we need to drop tags whose combined total count in any of its year of occurrence is 20 and below.
A small experiment below shows that the Year
column is not of integer type.
tag_over_20k_year_wend_counts[
(tag_over_20k_year_wend_counts["Year"] == 2008) &
(tag_over_20k_year_wend_counts["Tag"] == ".htaccess")
]
tag_over_20k_year_wend_counts.dtypes
Let's convert Year
to integer.
tag_over_20k_year_wend_counts["Year"] = tag_over_20k_year_wend_counts["Year"].astype(int)
tag_over_20k_year_wend_counts[
(tag_over_20k_year_wend_counts["Year"] == 2008) &
(tag_over_20k_year_wend_counts["Tag"] == ".htaccess")
]
Success!
tags_to_remove = set()
for tag in tag_over_20k_year_wend_counts["Tag"].unique():
for year in range(2008, 2016 + 1):
tag_year_df = tag_over_20k_year_wend_counts[
(tag_over_20k_year_wend_counts["Year"] == year) &
(tag_over_20k_year_wend_counts["Tag"] == tag)
]
if tag_year_df.shape[0] == 1:
if tag_year_df.iloc[0]["WeekendTotal"] + tag_year_df.iloc[0]["WeekdayTotal"] <= 20:
tags_to_remove.add(tag)
break
print(tags_to_remove)
Now we remove those tags whose total number of posts in any year is <= 20:
tag_over_20k_year_wend_counts = tag_over_20k_year_wend_counts[
~tag_over_20k_year_wend_counts["Tag"].isin(tags_to_remove)
]
We'll also delete data for the year 2017.
tag_over_20k_year_wend_counts = tag_over_20k_year_wend_counts[
tag_over_20k_year_wend_counts["Year"] != 2017
]
For the next part, we'll be extracting the Year
, WeekendTotal
and WeekdayTotal
data for each tag, then perform the "modeling" (which is totally non-obvious and we'll go through in some detail later on) to find obtain the tags whose weekend proportion have changed the most over the years. To do this, we have to make use of the statsmodel package.
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.sandbox.stats.multicomp as sm_multicomp
unique_tags = tag_over_20k_year_wend_counts["Tag"].unique()
trend_data_for_each_tag = {}
for tag in unique_tags:
trend_data_for_each_tag[tag] = tag_over_20k_year_wend_counts[
tag_over_20k_year_wend_counts["Tag"] == tag
]
So this is part of the "modeling" Julia Silge mentioned in one sentence of her blog post. This is almost the equivalent of the glm(cbind(nn, YearTagTotal) ~ Year, ., family="binomial")
in her R kernel, except that I believe she made a bug and the 2nd element in the cbind
should be the number of questions with that tag asked on weekdays for that year, not the total number of questions with that tag asked in the year.
According to the R documentation for glm
:
A typical predictor has the form ‘response ~ terms’ where
‘response’ is the (numeric) response vector and ‘terms’ is a
series of terms which specifies a linear predictor for ‘response’.
For ‘binomial’ and ‘quasibinomial’ families the response can also
be specified as a ‘factor’ (when the first level denotes failure
and all others success) or as a two-column matrix with the columns
giving the numbers of successes and failures. A terms
specification of the form ‘first + second’ indicates all the terms
in ‘first’ together with all the terms in ‘second’ with any
duplicates removed.
We got a PerfectSeparationError
during the model fitting.
trend_models = {}
for tag, df in trend_data_for_each_tag.items():
try:
trend_models[tag] = smf.glm(
formula="WeekendTotal + WeekdayTotal ~ Year",
data=df,
family=sm.families.Binomial(),
).fit()
except statsmodels.tools.sm_exceptions.PerfectSeparationError:
print(tag)
trend_data_for_each_tag["angular2"]
I do not know the exact reason for this. In fact, I do not know what the glm
is doing in this case. It seems to be doing logistic regression but for regression and not classification. I can't find any similar functionality in scikit-learn so I had to use the statsmodels library.
The next part extracts the p-values for the Year
of each model and then adjusts them, then only retains those tags whose model's adjusted pvalue is less than 0.1
tags_with_year_pvalues = []
for tag, model in trend_models.items():
tags_with_year_pvalues.append((tag, model.pvalues["Year"],))
adjusted_pvalues = sm_multicomp.multipletests(
list(map(lambda t: t[1], tags_with_year_pvalues))
)[1]
non_fluke_tags = list(
map(
lambda t: t[0],
filter(
lambda t: t[1] < 0.1,
list(
zip(
map(lambda t: t[0], tags_with_year_pvalues),
adjusted_pvalues
)
)
)
)
)
Now we extract the 8 tags whose weekend activity relative to weekday activity has decreased the most.
tag_activity_trend = []
for tag, model in trend_models.items():
if tag in non_fluke_tags:
tag_activity_trend.append((tag, model.params["Year"],))
tag_activity_trend.sort(key=lambda t: t[1])
tags_most_decreased_wend_to_wday_activity = tag_activity_trend[:8]
tags_most_increased_wend_to_wday_activity = tag_activity_trend[-8:]
tags_most_decreased_wend_to_wday_activity
tags_most_increased_wend_to_wday_activity
For the top 8 decreases, Julia Silge's analysis contains the tags azure
, ruby-on-rails-4
whereas ours don't. But we have the tags go
and extjs
. Still pretty similar.
For the top 8 increases, Julia Silge's analysis contains the tags android-layout
, unity3d
whereas ours don't. But we have the tags jquery-mobile
and swing
. Pretty similar too.
Let's plot the actual data for these tags.
# Turns out the `Year` column on `year_wday_counts` is not int.
# Let's convert it
year_wday_counts["Year"] = year_wday_counts["Year"].astype(int)
year_wday_qn_counts = year_wday_counts[year_wday_counts["Weekday"]]
year_wday_qn_counts.pop("Weekday")
year_wend_qn_counts = year_wday_counts[year_wday_counts["Weekday"] != True].copy()
year_wend_qn_counts.pop("Weekday");
fig, ax = plt.subplots(figsize=(12,12,))
trend_plot_colors = [
sns.xkcd_rgb["salmon"],
sns.xkcd_rgb["tan"],
sns.xkcd_rgb["dark lime"],
sns.xkcd_rgb["greenish teal"],
sns.xkcd_rgb["brown red"],
sns.xkcd_rgb["azure"],
sns.xkcd_rgb["light purple"],
sns.xkcd_rgb["pink"],
]
idx = 0
for tag, _ in tags_most_decreased_wend_to_wday_activity:
df = trend_data_for_each_tag[tag].copy()
df = df.merge(
year_wend_qn_counts,
on="Year",
how="left",
)
df["WeekendTotal"] /= df["Total"]
df.pop("Total")
df = df.merge(
year_wday_qn_counts,
on="Year",
how="left",
)
df["WeekdayTotal"] /= df["Total"]
df.pop("Total")
df["Ratio"] = df["WeekendTotal"] / df["WeekdayTotal"]
ax.plot(
df["Year"],
df["Ratio"],
label=tag,
linewidth=5,
color=trend_plot_colors[idx],
)
idx += 1
leg = ax.legend(fontsize=15,)
leg.set_title("Tag", prop={"size": 18,})
ax.set_ylabel("Relative weekend/weekday use", fontsize=15, labelpad=20,)
ax.set_title(
"Which tags' weekend activity has decreased the most?",
fontsize=18,
fontweight="bold",
)
ttl = ax.title
ttl.set_position([0.5, 1.02])
Of note in this plot:
Time to do the same plot but for tags whose weekend activity have increased the most.
fig, ax = plt.subplots(figsize=(12,12,))
idx = 0
for tag, _ in tags_most_increased_wend_to_wday_activity:
df = trend_data_for_each_tag[tag].copy()
df = df.merge(
year_wend_qn_counts,
on="Year",
how="left",
)
df["WeekendTotal"] /= df["Total"]
df.pop("Total")
df = df.merge(
year_wday_qn_counts,
on="Year",
how="left",
)
df["WeekdayTotal"] /= df["Total"]
df.pop("Total")
df["Ratio"] = df["WeekendTotal"] / df["WeekdayTotal"]
ax.plot(
df["Year"],
df["Ratio"],
label=tag,
linewidth=5,
color=trend_plot_colors[idx],
)
idx += 1
leg = ax.legend(fontsize=15,)
leg.set_title("Tag", prop={"size": 18,})
ax.set_ylabel("Relative weekend/weekday use", fontsize=15, labelpad=20,)
ax.set_title(
"Which tags' weekend activity has increased the most?",
fontsize=18,
fontweight="bold",
)
ttl = ax.title
ttl.set_position([0.5, 1.02])
Of note in this plot:
jquery-mobile
, listview
, android-fragments
, button
, gridview
.swing
is seeing increasing weekend usage. Could it also be mobile related?selenium
experienced a decline from 2008 to 2012, then seen some resurgence from 2012 to 2016gridview
and selenium
, despite being on this plot, are still weekday dominant technologiesTime for a final plot that shows all the tags with over 20,000 questions and their relative weekend / weekday activity, overall.
tag_wend_wday_rel_activity = tag_wday_counts.copy()
tag_wend_wday_rel_activity["Total"] = \
tag_wend_wday_rel_activity["Weekday"] + tag_wend_wday_rel_activity["Weekend"]
tag_wend_wday_rel_activity["Weekend"] /= nr_wend_qns
tag_wend_wday_rel_activity["Weekday"] /= nr_wday_qns
tag_wend_wday_rel_activity["WeekendOverWeekday"] = \
tag_wend_wday_rel_activity["Weekend"] / tag_wend_wday_rel_activity["Weekday"]
fig, ax = plt.subplots(figsize=(15, 15,))
ax.scatter(
tag_wend_wday_rel_activity["Total"],
tag_wend_wday_rel_activity["WeekendOverWeekday"],
)
for tag, coord in zip(
tag_wend_wday_rel_activity["Tag"],
zip(
tag_wend_wday_rel_activity["Total"],
tag_wend_wday_rel_activity["WeekendOverWeekday"]
)
):
ax.annotate(tag, coord)
plt.axhline(y=1, linewidth=4, color="red", linestyle="dashed",)
ax.set_xscale("log")
ax.set_xlabel("Total # of questions", fontsize=15, labelpad=20,)
ax.set_ylabel("Relative use on weekends vs. weekdays", fontsize=15,)
ax.set_xticks([1e4, 1e5, 1e6])
ax.set_xticklabels(["1e+04", "1e+05", "1e+06",])
ax.set_yticks([0.5, 1, 2])
ax.set_yticklabels(["1/2 on Weekends", "Same", "2x on Weekends"])
fig.suptitle(
"Which tags have the biggest weekend/weekday differences?",
fontsize=20,
fontweight="bold",
x=0.44,
)
ttl = ax.set_title(
"For tags with more than 20,000 questions",
loc="left",
fontsize=15,
)
ttl.set_y(1.01)
plt.subplots_adjust(top=0.93)
Imho not a very useful plot because of the clustering in between 10,000 and 100,000 questions. This is probably best for identifying "outliers". We can easily spot the very popular technologies on the right side and technologies which are very weekend or weekday dominant.
This was pretty instructive. Initially we didn't take into account that we should exclude deleted questions. Moreover, we tried too hard to not load everything into memory but ironically ended up using even more memory. Lol. I am very thankful for AWS. The relatively small size of this data set also means that a m4.xlarge instance with 16GB memory is sufficient for our purposes and we have a pretty small bill.
spread
in R's dplyr package.The downside is that there is almost no machine learning involved. Which is the area I want to get the most hands on experience in.
glm(cbind(nn, YearTagTotal) ~ Year, data, family="binomial")
using? And what do the p values represent?.head()
call or print them out to get the schema. Is there a better way to manage these dataframes?Disclaimer: Opinions expressed on this blog are solely my own and do not express the views or opinions of my employer(s), past or present.