Reproducing "What Programming Languages Are Used Most on Weekends?"¶

I have a confession to make. I don't feel that I am a very naturally creative person. I don't mean that my mind is totally empty when it comes to thinking of new ideas, it's just that I find myself entering a state of helplessness whenever I try too hard to come up with totally novel ideas that will make for good publishing material.

I am looking to transition to a more machine learning oriented role for my next job. I need practice, lots of it. I lack ideas. Why not replicate what other people have done and get some hands-on practice in the process?

A few weeks ago, I read https://stackoverflow.blog/2017/02/07/what-programming-languages-weekends/ . It was pretty interesting. Let's try to replicate their results here.

Source code¶

IPython notebook at: https://gist.github.com/yanhan/fb630af813b00a179eaad3af7560d1bc
Dataset at: https://www.kaggle.com/stackoverflow/stacklite/downloads/stacklite.zip

NOTE: The dataset may be updated from time to time. Hence you may not get the same results as Julia Silge and myself. In particular, there's one part below where we use the statsmodels glm function and there was a PerfectSeparationError. You may or may not get that.

A little note¶

As much as possible, I am trying to avoid loading all the data into memory because my machine is not very powerful. So there are certain operations that can probably be done easily using libraries but I have avoided doing that.

About the data¶

The stacklite.zip file on Kaggle is 446MB. After decompressing, there are 2 files, questions.csv and question_tags.csv. questions.csv is 862MB and questions_tags.csv is 844MB. While it is possible to load the entire dataset into memory, there's no need to. I thought of loading the data into MySQL but then thought again and believed there was no need to.

It is pretty obvious that questions.csv stores questions data. The first 6 lines (including the header) look like this:

Id,CreationDate,ClosedDate,DeletionDate,Score,OwnerUserId,AnswerCount
1,2008-07-31T21:26:37Z,NA,2011-03-28T00:53:47Z,1,NA,0
4,2008-07-31T21:42:52Z,NA,NA,458,8,13
6,2008-07-31T22:08:08Z,NA,NA,207,9,5
8,2008-07-31T23:33:19Z,2013-06-03T04:00:25Z,2015-02-11T08:26:40Z,42,NA,8
9,2008-07-31T23:40:59Z,NA,NA,1410,1,58

Pretty self explanatory.

The contents of question_tags.csv however, stumped me a bit:

Id,Tag
1,data
4,c#
4,winforms
4,type-conversion
4,decimal
4,opacity
6,html
6,css
6,css3
6,internet-explorer-7

Initially, I thought it was a list of unique tags. But after checking out Julia Silge's kernel on kaggle and upon closer inspection, I realized the Id column represents the question id and the Tag column is a tag for that question.

Library imports, globals and helper functions¶

from collections import Counter, defaultdict

import datetime

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline

_ASKED_ON_WEEKDAY = 0
_ASKED_ON_WEEKEND = 1

def _is_weekday(date_string):
    return datetime.datetime.strptime(
        date_string,
        "%Y-%m-%dT%H:%M:%SZ"
    ).weekday() < 5

Check whether data is clean¶

This is actually something we came back to do after some steps in.

For questions.csv, we check that there are no duplicate question ids.

For question_tags.csv, we check that there are no duplicate tags for each question.

question_ids = set()
questions_csv_clean = True
with open("questions.csv", "r") as f:
    # skip header
    f.readline()
    for line in f:
        str_question_id, _, _, _, _, _, _ = line.strip().split(",")
        question_id = int(str_question_id)
        if question_id in question_ids:
            print("Duplicate question id: {}".format(question_id))
            questions_csv_clean = False
        else:
            question_ids.add(question_id)
print("questions.csv clean? {}".format(questions_csv_clean))

questions.csv clean? True

del question_ids

question_tags = {}
question_tags_csv_clean = True
nr_errors = 0
with open("question_tags.csv", "r") as f:
    # skip header
    f.readline()
    for line in f:
        str_question_id, tag = line.strip().split(",")
        question_id = int(str_question_id)
        if question_id not in question_tags:
            question_tags[question_id] = set([tag])
        elif tag in question_tags[question_id]:
            print("Duplicate tag `{}` for question id {}".format(
                tag, question_id
            ))
            question_tags_csv_clean = False
            nr_errors += 1
            if nr_errors >= 10:
                break
        else:
            question_tags[question_id].add(tag)
print("question_tags.csv clean? {}".format(question_tags_csv_clean))

Duplicate tag `security` for question id 9800
Duplicate tag `localization` for question id 14158
Duplicate tag `xslt` for question id 20863
Duplicate tag `winapi` for question id 54086
Duplicate tag `xslt` for question id 56567
Duplicate tag `fun` for question id 56582
Duplicate tag `web-applications` for question id 56959
Duplicate tag `shell` for question id 72428
Duplicate tag `interview-questions` for question id 85753
Duplicate tag `security` for question id 87393
question_tags.csv clean? False

Uh oh. The question_tags.csv file isn't clean. There are a lot more duplicate tags than what I've shown here but going through the entire data set causes my system to run out of memory and to save space I'm not gonna show you here.

Before we clean up the data, let us see if the question ids in question_tags.csv come in non-decreasing order. It will simplify the cleaning up work if that's the case.

previous_question_id = -10**9
question_ids_non_decreasing = True
with open("question_tags.csv", "r") as f:
    # skip header
    f.readline()
    for line in f:
        str_question_id, _ = line.strip().split(",")
        question_id = int(str_question_id)
        if question_id < previous_question_id:
            print("question ids do not come in non-decreasing order. In particular, {} comes after {}".format(
                question_id, previous_question_id
            ))
            question_ids_non_decreasing = False
            break
        else:
            previous_question_id = question_id
print("Are question ids in question_tags.csv non-decreasing? {}".format(
    question_ids_non_decreasing
))

Are question ids in question_tags.csv non-decreasing? True

Nice. This vastly simplifies the clean-up work in that we only need to store all tags for the current question and upon encountering a larger question id, we write the tags for the current question to the new csv file. Rinse and repeat.

Cleaning up `question_tags.csv`¶

NOTE: The code in the following cell creates a question_tags_clean.csv file on your system.

with open("question_tags_clean.csv", "w") as out_f:
    out_f.write("Id,Tag\n")
    with open("question_tags.csv", "r") as in_f:
        # skip header
        in_f.readline()
        current_question_id = None
        current_question_tags = set()
        for line in in_f:
            str_question_id, tag = line.strip().split(",")
            question_id = int(str_question_id)
            if question_id != current_question_id:
                # flush previous question's tags to file
                for t in current_question_tags:
                    out_f.write("{},{}\n".format(current_question_id, t))
                current_question_id = question_id
                current_question_tags = set([tag,])
            else:
                current_question_tags.add(tag)
    if current_question_id is not None:
        for tag in current_question_tags:
            out_f.write("{},{}\n".format(current_question_id, tag))

Diving in¶

Let us first determine how many unique tags there are.

unique_tags = set()
with open("question_tags_clean.csv", "r") as f:
    # Skip header line
    f.readline()
    for line in f:
        _, tag = line.split(",")
        unique_tags.add(tag)
len(unique_tags)

58256

Now, let us extract all the tags, along with their associated questions, then only keep those tags with over 20,000 questions.

tags_to_questions = defaultdict(set)
with open("question_tags_clean.csv", "r") as f:
    # Skip header line
    f.readline()
    for line in f:
        question_id, tag = line.strip().split(",")
        tags_to_questions[tag.strip()].add(int(question_id))
over20k_tags_to_questions = {}
for tag, questions_set in tags_to_questions.items():
    if len(questions_set) >= 20000:
        over20k_tags_to_questions[tag] = questions_set
del tags_to_questions
print("# tags with >= 20,000 questions: {}".format(len(over20k_tags_to_questions)))

# tags with >= 20,000 questions: 319

Let's dump this subset of data to a file. We will shutdown this Jupyter notebook and load that smaller data set back in.

with open("over20k_tags_to_questions.txt", "w") as f:
    for tag, questions_set in over20k_tags_to_questions.items():
        f.write("{}:{}\n".format(tag, ",".join(map(str, questions_set))))

Let's load the subset of data back in:

over20k_tags_to_questions = {}
with open("over20k_tags_to_questions.txt", "r") as f:
    for line in f:
        tag, questions = line.split(":")
        over20k_tags_to_questions[tag] = set(map(int, questions.split(",")))

How many questions on weekdays? How many questions on weekends?¶

nr_questions_on_weekdays = 0
nr_questions_on_weekends = 0
with open("questions.csv", "r") as f:
    # Skip header line
    f.readline()
    for line in f:
        _, creation_date, _, _, _, _, _ = line.strip().split(",")
        if creation_date != "NA":
            if _is_weekday(creation_date):
                nr_questions_on_weekdays += 1
            else:
                nr_questions_on_weekends += 1

print("# questions on weekdays: {}".format(nr_questions_on_weekdays))
print("# questions on weekends: {}".format(nr_questions_on_weekends))

# questions on weekdays: 14227389
# questions on weekends: 2976435

Those are quite different numbers from those in Julia Silge's post:

Overall, this includes 10,451,274 questions on weekdays and 2,132,073 questions on weekends.

Probably the data set was updated some time after the post was written.

Something struck me for the next step. What does the author mean by relative frequency? Relative to what? Turns out she included the definition:

Instead, let’s explore which tags made up a larger share of weekend questions than they did of weekday questions, and vice versa.

So we're gonna go through all the tags with at least 20,000 questions, find out how many of the questions for each tag were posted on weekdays and weekends, then do a comparison.

# Get all question ids for tags with at least 20,000 questions
qns_with_some_tag_over_20k_qns = set()
i = 0
for questions_set in over20k_tags_to_questions.values():
    qns_with_some_tag_over_20k_qns.update(questions_set)

# For questions we are interested in, get whether they are asked during
# weekday or weekend
qoi_to_wday_wend = dict.fromkeys(
    qns_with_some_tag_over_20k_qns,
    -1
)
with open("questions.csv", "r") as f:
    # skip header
    f.readline()
    for line in f:
        str_question_id, creation_date, _, _, _, _, _ = line.strip().split(",")
        question_id = int(str_question_id)
        if question_id in qoi_to_wday_wend and creation_date != "NA":
            if datetime.datetime.strptime(
                creation_date,
                "%Y-%m-%dT%H:%M:%SZ"
            ).weekday() < 5:
                qoi_to_wday_wend[question_id] = _ASKED_ON_WEEKDAY
            else:
                qoi_to_wday_wend[question_id] = _ASKED_ON_WEEKEND


# Now go through all the tags with over 20,000 questions and count
# the # of questions asked during weekdays vs. weekends
tag_to_nr_qns_on_wday = Counter()
tag_to_nr_qns_on_wend = Counter()
for tag, questions_set in over20k_tags_to_questions.items():
    tag_to_nr_qns_on_wday[tag] = tag_to_nr_qns_on_wend[tag] = 0
    for question_id in questions_set:
        if qoi_to_wday_wend[question_id] == _ASKED_ON_WEEKDAY:
            tag_to_nr_qns_on_wday[tag] += 1
        elif qoi_to_wday_wend[question_id] == _ASKED_ON_WEEKEND:
            tag_to_nr_qns_on_wend[tag] += 1

# Get top 20 tags with higher relative frequency of questions asked during weekdays

_INF = 10**9
wday_rel_freq_list = []
wend_rel_freq_list = []
total_nr_wday_qns = sum(tag_to_nr_qns_on_wday.values())
total_nr_wend_qns = sum(tag_to_nr_qns_on_wend.values())
for tag in over20k_tags_to_questions:
    wday_nr_qns = tag_to_nr_qns_on_wday[tag]
    wend_nr_qns = tag_to_nr_qns_on_wend[tag]
    wday_freq = wday_nr_qns / total_nr_wday_qns
    wend_freq = wend_nr_qns / total_nr_wend_qns
    if wend_freq == 0:
        wday_rel_freq_list.append((tag, _INF,))
    else:
        wday_rel_freq_list.append((tag, wday_freq / wend_freq,))
    if wday_freq == 0:
        wend_rel_freq_list.append((tag, _INF,))
    else:
        wend_rel_freq_list.append((tag, wend_freq / wday_freq,))

wday_rel_freq_list.sort(key=lambda t: t[1])
print(wday_rel_freq_list[-20:])

wend_rel_freq_list.sort(key=lambda t: t[1])
print(wend_rel_freq_list[-20:])

[('selenium', 1.6212588979148064), ('wcf', 1.6553375679897844), ('excel', 1.7083207272949648), ('internet-explorer', 1.7140582549884504), ('oracle', 1.7265489637766018), ('excel-vba', 1.7414659008716018), ('svn', 1.7510314654119836), ('sql-server-2008', 1.7575099770444054), ('selenium-webdriver', 1.7661248332256643), ('iis', 1.774721760837079), ('vba', 1.8236177815794705), ('xslt', 1.8283513564212965), ('soap', 1.8400984372730709), ('extjs', 1.8717772104551975), ('powershell', 2.0238479773037628), ('tsql', 2.074704473127169), ('sql-server-2005', 2.1723285999710455), ('jenkins', 2.4758333202566383), ('sharepoint', 2.8651044383099826), ('reporting-services', 3.6477980122694946)]
[('parse.com', 1.3503814514822798), ('math', 1.3551432364197442), ('class', 1.3594760064146303), ('arraylist', 1.3600894357955087), ('firebase', 1.362822521786504), ('recursion', 1.3714568879068183), ('graphics', 1.377539748907886), ('random', 1.3820943836369635), ('methods', 1.3824713337624932), ('data-structures', 1.3913036081744454), ('python-3.x', 1.3983445990946155), ('swing', 1.400912744220479), ('google-chrome-extension', 1.4026948786028015), ('c', 1.4271826837540982), ('mysqli', 1.4483936702618714), ('pointers', 1.5043601718360466), ('algorithm', 1.5074228566575227), ('opengl', 1.604622444469804), ('assembly', 1.663642477327844), ('haskell', 1.683808410440941)]

wday_df = pd.DataFrame(
    list(
        map(
            lambda t: [t[0], float(t[1]) * 100],
            wday_rel_freq_list[-20:][::-1]
        )
    ),
    columns=["tag", "relative_frequency"],
)
wend_df = pd.DataFrame(
    list(
        map(
            lambda t: [t[0], float(t[1]) * 100],
            wend_rel_freq_list[-20:][::-1]
        )
    ),
    columns=["tag", "relative_frequency"],
)
fig, (ax1, ax2,) = plt.subplots(ncols=2, figsize=(15,15,))
sns.barplot(
    x="relative_frequency",
    y="tag", data=wday_df,
    color=sns.xkcd_rgb["coral"],
    ax=ax1,
)
sns.barplot(
    x="relative_frequency",
    y="tag",
    data=wend_df,
    color=sns.xkcd_rgb["teal"],
    ax=ax2,
)
plt.subplots_adjust(wspace=0.2)
plt.xticks([0, 50, 100, 150, 200, 250, 300, 350, 400,])
plt.subplots_adjust(top=0.9)
fig.suptitle(
    "Which tags have the biggest weekday/weekend differences?\nFor tags with more than 20,000 questions",
    size=20,
)
ax1.set_title("Weekdays")
ax1.set_xlabel("")
ax1.set_ylabel("")
ax2.set_title("Weekends")
ax2.set_xlabel("")
ax2.set_ylabel("")
fig.text(0.5, 0.09, "Relative frequency", ha="center", size=18,);

The next part deals with tags with biggest decrease in weekend activity. Here, decrease means the change between years 2016 and 2008 (ignoring trends in between those years).

qns_with_some_tag_over_20k_qns__to__tags = defaultdict(set)
for tag, questions_set in over20k_tags_to_questions.items():
    for question_id in questions_set:
        qns_with_some_tag_over_20k_qns__to__tags[question_id].add(tag)

def _add_to_relevant_count(
        tags_activity_dict,
        question_id,
        creation_date
    ):
    for tag in qns_with_some_tag_over_20k_qns__to__tags[question_id]:
        if tag not in tags_activity_dict:
            tags_activity_dict[tag] = (0, 0,)
        wday_cnt, wend_cnt = tags_activity_dict[tag]
        if _is_weekday(creation_date):
            tags_activity_dict[tag] = (wday_cnt + 1, wend_cnt,)
        else:
            tags_activity_dict[tag] = (wday_cnt, wend_cnt + 1,)

tags_with_over20k_qns_2008_activity = {}
tags_with_over20k_qns_2016_activity = {}
with open("questions.csv", "r") as f:
    # skip header
    f.readline()
    for line in f:
        str_question_id, creation_date, _, _, _, _, _ = line.strip().split(",")
        question_id = int(str_question_id)
        if question_id in qns_with_some_tag_over_20k_qns__to__tags:
            str_year_of_creation = creation_date[:4]
            if str_year_of_creation == "2008":
                _add_to_relevant_count(
                    tags_with_over20k_qns_2008_activity,
                    question_id,
                    creation_date
                )
            elif str_year_of_creation == "2016":
                _add_to_relevant_count(
                    tags_with_over20k_qns_2016_activity,
                    question_id,
                    creation_date
                )

And... in the above step, my computer ran out of memory. Lol. Time to spin up an EC2 instance.

tags_wend_over_wday_ratio = []
tags_with_zero_count = 0
for tag in tags_with_over20k_qns_2008_activity:
    if tag in tags_with_over20k_qns_2016_activity:
        wday_2008, wend_2008 = tags_with_over20k_qns_2008_activity[tag]
        wday_2016, wend_2016 = tags_with_over20k_qns_2016_activity[tag]
        # Let's ignore tags that have 0 counts, because one of the ratios will be
        # infinity
        if wday_2008 != 0 and wend_2008 != 0 and wday_2016 != 0 and wend_2016 != 0:
            tags_wend_over_wday_ratio.append(
                (tag, wend_2008 / wday_2008, wend_2016 / wday_2016,)
            )
tags_wend_over_wday_ratio.sort(
    key=lambda t: t[1] - t[2]
)
print(tags_wend_over_wday_ratio[-10:])
print(sorted(tags_wend_over_wday_ratio, key=lambda t: t[2] - t[1])[-10:])
#print(tags_with_over20k_qns_2008_activity)
print(tags_with_over20k_qns_2008_activity["scala"])
print(tags_with_over20k_qns_2016_activity["scala"])
print(sorted(tags_wend_over_wday_ratio, key=lambda t: t[2])[-10:])

[('canvas', 0.4444444444444444, 0.25225225225225223), ('grails', 0.3442622950819672, 0.14285714285714285), ('numpy', 0.5, 0.24088171798659244), ('magento', 0.5, 0.13784037558685447), ('twitter', 0.625, 0.24564134495641346), ('html5', 0.6666666666666666, 0.22732272069464543), ('multidimensional-array', 0.75, 0.27277716794731066), ('web', 1.0, 0.26509023024268824), ('python-2.7', 1.0, 0.2331143105614069), ('razor', 1.0, 0.15828285488543586)]
[('file-io', 0.12857142857142856, 0.28909090909090907), ('inheritance', 0.09836065573770492, 0.26060452120904243), ('nginx', 0.058823529411764705, 0.22553614543388723), ('opengl', 0.18032786885245902, 0.3578835416112736), ('login', 0.04838709677419355, 0.24481910783280647), ('uitableview', 0.04, 0.25097198015819816), ('swing', 0.12690355329949238, 0.34034754276405105), ('methods', 0.1, 0.31420137750081994), ('javafx', 0.07142857142857142, 0.3006620377845955), ('arraylist', 0.043478260869565216, 0.30917874396135264)]
(21, 6)
(15005, 2988)
[('random', 0.16326530612244897, 0.31997571341833636), ('c', 0.21157495256166983, 0.3249356802608452), ('data-structures', 0.20625, 0.3303324099722992), ('algorithm', 0.18421052631578946, 0.34018177652180553), ('swing', 0.12690355329949238, 0.34034754276405105), ('pointers', 0.25, 0.34140249759846303), ('graphics', 0.20765027322404372, 0.3479816044966786), ('opengl', 0.18032786885245902, 0.3578835416112736), ('haskell', 0.22077922077922077, 0.36882393876130826), ('assembly', 0.32051282051282054, 0.3778652465848576)]

Looks very different from what's in Julia Silge's post. In fact, upon reading the R code in her Kaggle kernel, I found that our approach is completely different. Let's redo everything.

Redo¶

Turns out that in Julia Silge's kernel, she first filters out all questions whose deletion date is NA. Let's do that and drop unnecessary columns.

questions = pd.read_csv("questions.csv")
questions = questions.loc[questions["DeletionDate"].isnull(), :]
questions.pop("DeletionDate")
questions.pop("ClosedDate")
questions.pop("Score")
questions.pop("OwnerUserId")
questions.pop("AnswerCount")
questions.head()

Now we load all the tags for the questions, but only for those questions which have not been deleted. We do this by using the merge method of pandas.DataFrame.

question_tags_clean = pd.read_csv("question_tags_clean.csv")
question_tags_clean = question_tags_clean.merge(
    questions,
    on="Id",
    how="inner",
)
question_tags_clean.pop("CreationDate");

We are only interested in tags which have over 20,000 questions. Time to do some filtering.

tags_with_counts = question_tags_clean.groupby("Tag").count()
tags_with_counts = tags_with_counts.reset_index()
tags_with_counts.columns = ["Tag", "Count"]
tags_with_counts = tags_with_counts[tags_with_counts.Count > 20000]

question_tags_clean = question_tags_clean.merge(
    tags_with_counts,
    on="Tag",
    how="inner",
)
question_tags_clean.pop("Count")

del tags_with_counts

Let's obtain the set of questions with some tag that has over 20,000 questions.

questions_with_tag_over_20k = questions.merge(
    pd.DataFrame({"Id": question_tags_clean["Id"].unique(),}),
    on="Id",
    how="inner",
)

questions_with_tag_over_20k["Weekday"] = \
    questions_with_tag_over_20k["CreationDate"].apply(_is_weekday)

tag_wday_counts = question_tags_clean.merge(
    questions_with_tag_over_20k,
    on="Id",
    how="inner"
).groupby(["Tag", "Weekday"]).count()

tag_wday_counts.head()

Let's get rid of the hierarchical index formed by groupby.

tag_wday_counts = tag_wday_counts.reset_index()
tag_wday_counts.head()

The CreationDate column is redundant. Let's remove it.

tag_wday_counts.pop("CreationDate")
tag_wday_counts.columns = ["Tag", "Weekday", "Count"]
tag_wday_counts.head()

We want something like spread in tidyr. I saw on https://chrisalbon.com/python/pandas_long_to_wide.html that we can use pivot.

tag_wday_counts = tag_wday_counts.pivot(
    index="Tag",
    columns="Weekday",
    values="Count"
)

tag_wday_counts.head()

Let's remove the hierarchical index using reset_index.

tag_wday_counts = tag_wday_counts.reset_index()
tag_wday_counts.head()

Let's rename the columns:

tag_wday_counts.columns=["Tag", "Weekend", "Weekday"]
tag_wday_counts.head()

To obtain the rate of questions asked during weekends and weekdays for each tag, we have to calculate the total number of questions asked during weekdays and weekends.

questions["Weekday"] = questions["CreationDate"].apply(_is_weekday)
x = questions.groupby("Weekday").count()
x.head()

nr_wday_qns = x.loc[True, "Id"]
nr_wend_qns = x.loc[False, "Id"]

x = tag_wday_counts.copy()
x["Weekday"] /= nr_wday_qns
x["Weekend"] /= nr_wend_qns

x[x["Weekday"] == 0].shape

(0, 3)

x["WeekendOverWeekday"] = x["Weekend"] / x["Weekday"]

x = x.sort_values(by="WeekendOverWeekday", ascending=False,)
x[:16]

These are the exact same tags as that in the original post and in the same order. Now let's try getting at the highest weekday / weekend activity tags.

x[-16:][::-1]

Barring the incorrect numbers in the WeekendOverWeekday column which can be corrected by taking reciprocals, the tags are the same as those in the original post and in the same order.

Now let's plot the bar charts.

highest_wend_plot_df = x[:16].loc[:, ["Tag", "WeekendOverWeekday"]]
highest_wend_plot_df["WeekendOverWeekday"] *= 100

highest_wday_plot_df = x[-16:].loc[:, ["Tag", "WeekendOverWeekday"]][::-1]
highest_wday_plot_df["WeekendOverWeekday"] = 1 / highest_wday_plot_df["WeekendOverWeekday"] * 100

fig, (ax1, ax2,) = plt.subplots(ncols=2, figsize=(12,12,))
sns.barplot(
    x="WeekendOverWeekday",
    y="Tag",
    data=highest_wday_plot_df,
    color=sns.xkcd_rgb["coral"],
    ax=ax1,
)
sns.barplot(
    x="WeekendOverWeekday",
    y="Tag",
    data=highest_wend_plot_df,
    color=sns.xkcd_rgb["teal"],
    ax=ax2,
)
plt.subplots_adjust(wspace=0.2)
plt.xticks([0, 50, 100, 150, 200, 250, 300,])
plt.subplots_adjust(top=0.9)
fig.suptitle(
    "Which tags have the biggest weekday/weekend differences?\nFor tags with more than 20,000 questions",
    size=20,
)
ax1.set_title("Weekdays")
ax1.set_xlabel("")
ax1.set_ylabel("")
ax2.set_title("Weekends")
ax2.set_xlabel("")
ax2.set_ylabel("")
fig.text(0.5, 0.07, "Relative frequency", ha="center", size=18,);

Nice. So that's one mystery solved. Now comes the harder part.

For the next part, we will need to compute for each tag in tags with over 20,000 questions:

The total number of questions with that tag (for each year)
The total number of questions with that tag and were asked during weekends (for each year)

And for each year, we need to compute the number of questions asked during weekends and weekdays. Let's do this first.

# Add year information to `questions`
questions["Year"] = questions["CreationDate"].apply(
    lambda datestring: datestring[:4]
)

year_wday_counts = questions.copy()
year_wday_counts = year_wday_counts.groupby(
    ["Year", "Weekday"]
).count()
year_wday_counts = year_wday_counts.reset_index()
year_wday_counts.pop("Id")
year_wday_counts.rename(
    columns={
        "CreationDate": "Total",
    },
    inplace=True,
)
year_wday_counts.head()

Seems correct enough to me. Moving on.

Earlier we computed:

questions_with_tag_over_20k: the set of all questions with some tag that has over 20,000 questions

tag_wday_counts: the set of all tags where each tag has over 20,000 questions along with the breakdown of the number of questions asked during weekdays and weekends

question_tags_clean: the cleaned data originally from question_tags.csv

Let's take a look at them.

questions_with_tag_over_20k.head()

tag_wday_counts.head()

question_tags_clean.head()

Time to add year information to questions_with_tag_over_20k.

questions_with_tag_over_20k["Year"] = questions_with_tag_over_20k["CreationDate"].apply(
    lambda datestring: datestring[:4]
)

Now let's breakdown the tags by year and counts during weekday, weekend.

tag_over_20k_year_wend_counts = questions_with_tag_over_20k.merge(
    question_tags_clean,
    how="left",
    on="Id",
).groupby(["Year", "Tag", "Weekday"]).count()

tag_over_20k_year_wend_counts.head()

tag_over_20k_year_wend_counts.pop("CreationDate")
tag_over_20k_year_wend_counts.rename(
    columns={"Id": "WeekendTotal"},
    inplace=True,
)

tag_over_20k_year_wend_counts = tag_over_20k_year_wend_counts.reset_index()

tag_over_20k_year_wend_counts.head()

My pandas skill isn't so awesome. So I'll extract the weekday counts from this dataframe, drop the rows we extracted, then do a merge to insert the weekday counts.

weekday_counts = tag_over_20k_year_wend_counts.loc[
    tag_over_20k_year_wend_counts["Weekday"] == True
]

weekday_counts.head()

# Drop weekday counts
tag_over_20k_year_wend_counts.drop(
    tag_over_20k_year_wend_counts[
        tag_over_20k_year_wend_counts["Weekday"] == True
    ].index,
    inplace=True,
)

tag_over_20k_year_wend_counts.head()

tag_over_20k_year_wend_counts = tag_over_20k_year_wend_counts.merge(
    weekday_counts,
    how="inner",
    on=["Year", "Tag",],
    suffixes=("_l", "_r",),
)
tag_over_20k_year_wend_counts.head()

tag_over_20k_year_wend_counts.pop("Weekday_l")
tag_over_20k_year_wend_counts.pop("Weekday_r")
tag_over_20k_year_wend_counts.rename(
    columns={
        "WeekendTotal_l": "WeekendTotal",
        "WeekendTotal_r": "WeekdayTotal",
    },
    inplace=True,
)
tag_over_20k_year_wend_counts.head()

Nice. Now we almost have what we want in tag_over_20k_year_wend_counts and year_wday_counts.

But we need to drop tags whose combined total count in any of its year of occurrence is 20 and below.

A small experiment below shows that the Year column is not of integer type.

tag_over_20k_year_wend_counts[
    (tag_over_20k_year_wend_counts["Year"] == 2008) &
    (tag_over_20k_year_wend_counts["Tag"] == ".htaccess")
]

tag_over_20k_year_wend_counts.dtypes

Year            object
Tag             object
WeekendTotal     int64
WeekdayTotal     int64
dtype: object

Let's convert Year to integer.

tag_over_20k_year_wend_counts["Year"] = tag_over_20k_year_wend_counts["Year"].astype(int)

tag_over_20k_year_wend_counts[
    (tag_over_20k_year_wend_counts["Year"] == 2008) &
    (tag_over_20k_year_wend_counts["Tag"] == ".htaccess")
]

Success!

tags_to_remove = set()
for tag in tag_over_20k_year_wend_counts["Tag"].unique():
    for year in range(2008, 2016 + 1):
        tag_year_df = tag_over_20k_year_wend_counts[
            (tag_over_20k_year_wend_counts["Year"] == year) &
            (tag_over_20k_year_wend_counts["Tag"] == tag)
        ]
        if tag_year_df.shape[0] == 1:
            if tag_year_df.iloc[0]["WeekendTotal"] + tag_year_df.iloc[0]["WeekdayTotal"] <= 20:
                tags_to_remove.add(tag)
                break
print(tags_to_remove)

{'jenkins', 'python-2.7', 'swift', 'maven', 'express', 'symfony2', 'angularjs', 'pandas', 'hadoop', 'css3', 'razor', 'matrix', 'html5', 'numpy', 'unity3d', 'asp.net-web-api', 'asp.net-mvc-4', 'github', 'cordova', 'nginx', 'android-layout', 'web', 'opencv', 'c#-4.0', 'r', 'ruby-on-rails-4', 'core-data', 'azure', 'c++11', 'spring-mvc', 'if-statement', 'node.js'}

Now we remove those tags whose total number of posts in any year is <= 20:

tag_over_20k_year_wend_counts = tag_over_20k_year_wend_counts[
    ~tag_over_20k_year_wend_counts["Tag"].isin(tags_to_remove)
]

We'll also delete data for the year 2017.

tag_over_20k_year_wend_counts = tag_over_20k_year_wend_counts[
    tag_over_20k_year_wend_counts["Year"] != 2017
]

For the next part, we'll be extracting the Year, WeekendTotal and WeekdayTotal data for each tag, then perform the "modeling" (which is totally non-obvious and we'll go through in some detail later on) to find obtain the tags whose weekend proportion have changed the most over the years. To do this, we have to make use of the statsmodel package.

import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.sandbox.stats.multicomp as sm_multicomp

unique_tags = tag_over_20k_year_wend_counts["Tag"].unique()
trend_data_for_each_tag = {}
for tag in unique_tags:
    trend_data_for_each_tag[tag] = tag_over_20k_year_wend_counts[
        tag_over_20k_year_wend_counts["Tag"] == tag
    ]

So this is part of the "modeling" Julia Silge mentioned in one sentence of her blog post. This is almost the equivalent of the glm(cbind(nn, YearTagTotal) ~ Year, ., family="binomial") in her R kernel, except that I believe she made a bug and the 2nd element in the cbind should be the number of questions with that tag asked on weekdays for that year, not the total number of questions with that tag asked in the year.

According to the R documentation for glm:

 A typical predictor has the form ‘response ~ terms’ where
 ‘response’ is the (numeric) response vector and ‘terms’ is a
 series of terms which specifies a linear predictor for ‘response’.
 For ‘binomial’ and ‘quasibinomial’ families the response can also
 be specified as a ‘factor’ (when the first level denotes failure
 and all others success) or as a two-column matrix with the columns
 giving the numbers of successes and failures.  A terms
 specification of the form ‘first + second’ indicates all the terms
 in ‘first’ together with all the terms in ‘second’ with any
 duplicates removed.

We got a PerfectSeparationError during the model fitting.

trend_models = {}
for tag, df in trend_data_for_each_tag.items():
    try:
        trend_models[tag] = smf.glm(
            formula="WeekendTotal + WeekdayTotal ~ Year",
            data=df,
            family=sm.families.Binomial(),
        ).fit()
    except statsmodels.tools.sm_exceptions.PerfectSeparationError:
        print(tag)

angular2

trend_data_for_each_tag["angular2"]

I do not know the exact reason for this. In fact, I do not know what the glm is doing in this case. It seems to be doing logistic regression but for regression and not classification. I can't find any similar functionality in scikit-learn so I had to use the statsmodels library.

The next part extracts the p-values for the Year of each model and then adjusts them, then only retains those tags whose model's adjusted pvalue is less than 0.1

tags_with_year_pvalues = []
for tag, model in trend_models.items():
    tags_with_year_pvalues.append((tag, model.pvalues["Year"],))

adjusted_pvalues = sm_multicomp.multipletests(
    list(map(lambda t: t[1], tags_with_year_pvalues))
)[1]

non_fluke_tags = list(
    map(
        lambda t: t[0],
        filter(
            lambda t: t[1] < 0.1,
            list(
                zip(
                    map(lambda t: t[0], tags_with_year_pvalues),
                    adjusted_pvalues
                )
            )
        )
    )
)

Now we extract the 8 tags whose weekend activity relative to weekday activity has decreased the most.

tag_activity_trend = []
for tag, model in trend_models.items():
    if tag in non_fluke_tags:
        tag_activity_trend.append((tag, model.params["Year"],))
tag_activity_trend.sort(key=lambda t: t[1])
tags_most_decreased_wend_to_wday_activity = tag_activity_trend[:8]
tags_most_increased_wend_to_wday_activity = tag_activity_trend[-8:]

tags_most_decreased_wend_to_wday_activity

[('asp.net-mvc-3', -0.098378606676268743),
 ('visual-studio-2012', -0.087244426169319045),
 ('ruby-on-rails-3', -0.07524473192117373),
 ('internet-explorer', -0.064543551021871454),
 ('scala', -0.062395774887527344),
 ('go', -0.054768205613629328),
 ('extjs', -0.051498290400011942),
 ('svn', -0.049350481499238874)]

tags_most_increased_wend_to_wday_activity

[('jquery-mobile', 0.048516221470173815),
 ('swing', 0.050346843127050483),
 ('actionscript-3', 0.050591142495317057),
 ('listview', 0.054707989552278014),
 ('android-fragments', 0.056826010359758039),
 ('button', 0.057474225497405329),
 ('gridview', 0.063615078029569699),
 ('selenium', 0.078265141941906313)]

For the top 8 decreases, Julia Silge's analysis contains the tags azure, ruby-on-rails-4 whereas ours don't. But we have the tags go and extjs. Still pretty similar.

For the top 8 increases, Julia Silge's analysis contains the tags android-layout, unity3d whereas ours don't. But we have the tags jquery-mobile and swing. Pretty similar too.

Let's plot the actual data for these tags.

# Turns out the `Year` column on `year_wday_counts` is not int.
# Let's convert it
year_wday_counts["Year"] = year_wday_counts["Year"].astype(int)

year_wday_qn_counts = year_wday_counts[year_wday_counts["Weekday"]]
year_wday_qn_counts.pop("Weekday")

year_wend_qn_counts = year_wday_counts[year_wday_counts["Weekday"] != True].copy()
year_wend_qn_counts.pop("Weekday");

fig, ax = plt.subplots(figsize=(12,12,))
trend_plot_colors = [
    sns.xkcd_rgb["salmon"],
    sns.xkcd_rgb["tan"],
    sns.xkcd_rgb["dark lime"],
    sns.xkcd_rgb["greenish teal"],
    sns.xkcd_rgb["brown red"],
    sns.xkcd_rgb["azure"],
    sns.xkcd_rgb["light purple"],
    sns.xkcd_rgb["pink"],
]
idx = 0
for tag, _ in tags_most_decreased_wend_to_wday_activity:
    df = trend_data_for_each_tag[tag].copy()
    df = df.merge(
        year_wend_qn_counts,
        on="Year",
        how="left",
    )
    df["WeekendTotal"] /= df["Total"]
    df.pop("Total")
    df = df.merge(
        year_wday_qn_counts,
        on="Year",
        how="left",
    )
    df["WeekdayTotal"] /= df["Total"]
    df.pop("Total")
    df["Ratio"] = df["WeekendTotal"] / df["WeekdayTotal"]
    ax.plot(
        df["Year"],
        df["Ratio"],
        label=tag,
        linewidth=5,
        color=trend_plot_colors[idx],
    )
    idx += 1
leg = ax.legend(fontsize=15,)
leg.set_title("Tag", prop={"size": 18,})
ax.set_ylabel("Relative weekend/weekday use", fontsize=15, labelpad=20,)
ax.set_title(
    "Which tags' weekend activity has decreased the most?",
    fontsize=18,
    fontweight="bold",
)
ttl = ax.title
ttl.set_position([0.5, 1.02])

Of note in this plot:

Scala relative weekend over weekday usage sharply declined from 2008 to 2009. From 2011 onwards it is almost a steady downward trend towards 1.
Golang has also seen a steady decrease in weekend over weekday relative activity from 2011 to 2016
There are 3 Microsoft technologies on this list. asp.net-mvc3, visual-studio-2012 and internet-explorer. Very curious.

Time to do the same plot but for tags whose weekend activity have increased the most.

fig, ax = plt.subplots(figsize=(12,12,))
idx = 0
for tag, _ in tags_most_increased_wend_to_wday_activity:
    df = trend_data_for_each_tag[tag].copy()
    df = df.merge(
        year_wend_qn_counts,
        on="Year",
        how="left",
    )
    df["WeekendTotal"] /= df["Total"]
    df.pop("Total")
    df = df.merge(
        year_wday_qn_counts,
        on="Year",
        how="left",
    )
    df["WeekdayTotal"] /= df["Total"]
    df.pop("Total")
    df["Ratio"] = df["WeekendTotal"] / df["WeekdayTotal"]
    ax.plot(
        df["Year"],
        df["Ratio"],
        label=tag,
        linewidth=5,
        color=trend_plot_colors[idx],
    )
    idx += 1
leg = ax.legend(fontsize=15,)
leg.set_title("Tag", prop={"size": 18,})
ax.set_ylabel("Relative weekend/weekday use", fontsize=15, labelpad=20,)
ax.set_title(
    "Which tags' weekend activity has increased the most?",
    fontsize=18,
    fontweight="bold",
)
ttl = ax.title
ttl.set_position([0.5, 1.02])

Of note in this plot:

A good number of tags seem to be mobile oriented. jquery-mobile, listview, android-fragments, button, gridview.
swing is seeing increasing weekend usage. Could it also be mobile related?
selenium experienced a decline from 2008 to 2012, then seen some resurgence from 2012 to 2016
gridview and selenium, despite being on this plot, are still weekday dominant technologies

Time for a final plot that shows all the tags with over 20,000 questions and their relative weekend / weekday activity, overall.

tag_wend_wday_rel_activity = tag_wday_counts.copy()
tag_wend_wday_rel_activity["Total"] = \
    tag_wend_wday_rel_activity["Weekday"] + tag_wend_wday_rel_activity["Weekend"]
tag_wend_wday_rel_activity["Weekend"] /= nr_wend_qns
tag_wend_wday_rel_activity["Weekday"] /= nr_wday_qns
tag_wend_wday_rel_activity["WeekendOverWeekday"] = \
    tag_wend_wday_rel_activity["Weekend"] / tag_wend_wday_rel_activity["Weekday"]

fig, ax = plt.subplots(figsize=(15, 15,))
ax.scatter(
    tag_wend_wday_rel_activity["Total"],
    tag_wend_wday_rel_activity["WeekendOverWeekday"],
)

for tag, coord in zip(
        tag_wend_wday_rel_activity["Tag"],
        zip(
            tag_wend_wday_rel_activity["Total"],
            tag_wend_wday_rel_activity["WeekendOverWeekday"]
        )
    ):
    ax.annotate(tag, coord)

plt.axhline(y=1, linewidth=4, color="red", linestyle="dashed",)
ax.set_xscale("log")
ax.set_xlabel("Total # of questions", fontsize=15, labelpad=20,)
ax.set_ylabel("Relative use on weekends vs. weekdays", fontsize=15,)
ax.set_xticks([1e4, 1e5, 1e6])
ax.set_xticklabels(["1e+04", "1e+05", "1e+06",])
ax.set_yticks([0.5, 1, 2])
ax.set_yticklabels(["1/2 on Weekends", "Same", "2x on Weekends"])
fig.suptitle(
    "Which tags have the biggest weekend/weekday differences?",
    fontsize=20,
    fontweight="bold",
    x=0.44,
)
ttl = ax.set_title(
    "For tags with more than 20,000 questions",
    loc="left",
    fontsize=15,
)
ttl.set_y(1.01)
plt.subplots_adjust(top=0.93)

Imho not a very useful plot because of the clustering in between 10,000 and 100,000 questions. This is probably best for identifying "outliers". We can easily spot the very popular technologies on the right side and technologies which are very weekend or weekday dominant.

Conclusion¶

This was pretty instructive. Initially we didn't take into account that we should exclude deleted questions. Moreover, we tried too hard to not load everything into memory but ironically ended up using even more memory. Lol. I am very thankful for AWS. The relatively small size of this data set also means that a m4.xlarge instance with 16GB memory is sufficient for our purposes and we have a pretty small bill.

What we've picked up from this¶

Pandas. I believe I am much better at using it now. But it is quite a beast and I don't know if it has an equivalent of spread in R's dplyr package.
Reading R code. Imho R can be quite a horrible language with confusing documentation meant for pros. But I have to admit that it has some really neat dataframe manipulation functionality, both built-in and from libraries.
Simplifying some of the excessive data computation and storage made in Julia Selge's original blog post. This is done by going through her code and seeing what is actually needed, then making the necessary adjustments.
matplotlib. More experience making plots now.

The downside is that there is almost no machine learning involved. Which is the area I want to get the most hands on experience in.

Follow up questions / items¶

What algorithm is the glm(cbind(nn, YearTagTotal) ~ Year, data, family="binomial") using? And what do the p values represent?
During the process, we created a lot of pandas dataframes but they're like temporary database tables that are very hard to keep track of and for which we have to do a .head() call or print them out to get the schema. Is there a better way to manage these dataframes?

Weekday	False	True
Tag
.htaccess	10609	44636
.net	32613	214170
actionscript-3	7179	33219
activerecord	4093	18690
ajax	28791	133425

Weekday	Tag	False	True
0	.htaccess	10609	44636
1	.net	32613	214170
2	actionscript-3	7179	33219
3	activerecord	4093	18690
4	ajax	28791	133425

	Tag	Weekend	Weekday
0	.htaccess	10609	44636
1	.net	32613	214170
2	actionscript-3	7179	33219
3	activerecord	4093	18690
4	ajax	28791	133425

	Tag	Weekend	Weekday	WeekendOverWeekday
98	haskell	0.003667	0.002094	1.751218
26	assembly	0.002727	0.001587	1.718643
161	opengl	0.003114	0.001893	1.644761
171	pointers	0.003611	0.002361	1.529601
5	algorithm	0.007548	0.005048	1.495337
35	c	0.024511	0.016914	1.449149
177	python-3.x	0.005104	0.003557	1.434771
217	swing	0.006523	0.004567	1.428346
180	random	0.002008	0.001410	1.424809
183	recursion	0.002406	0.001713	1.404361
19	arraylist	0.002059	0.001486	1.385408
44	class	0.004584	0.003317	1.381949
237	vector	0.001995	0.001449	1.376647
137	math	0.002527	0.001871	1.350148
99	heroku	0.002220	0.001647	1.348040
39	c++11	0.003423	0.002540	1.348003

	Tag	Weekend	Weekday	WeekendOverWeekday
199	sharepoint	0.000656	0.001825	0.359364
119	jenkins	0.000696	0.001690	0.412106
223	tsql	0.001760	0.003560	0.494494
174	powershell	0.001753	0.003463	0.506278
202	soap	0.000938	0.001722	0.544416
236	vba	0.003418	0.006181	0.552990
76	extjs	0.000967	0.001748	0.553204
209	sql-server-2008	0.002265	0.003982	0.568767
257	xslt	0.001255	0.002195	0.571494
106	iis	0.001230	0.002144	0.573637
163	oracle	0.003731	0.006498	0.574247
73	excel-vba	0.002684	0.004599	0.583745
72	excel	0.005340	0.009138	0.584368
215	svn	0.001171	0.001968	0.595177
111	internet-explorer	0.001700	0.002817	0.603327
208	sql-server	0.009528	0.015332	0.621409

	Id	CreationDate
1	4	2008-07-31T21:42:52Z
2	6	2008-07-31T22:08:08Z
4	9	2008-07-31T23:40:59Z
5	11	2008-07-31T23:55:37Z
6	13	2008-08-01T00:42:38Z

	Year	Weekday	Total
0	2008	False	7506
1	2008	True	51254
2	2009	False	51778
3	2009	True	293509
4	2010	False	114376

	Id	Tag
0	4	winforms
1	482	winforms
2	1037	winforms
3	2770	winforms
4	2804	winforms

			Id	CreationDate
Year	Tag	Weekday
2008	.htaccess	False	5	5
	.htaccess	True	49	49
	.net	False	719	719
	.net	True	5202	5202
	actionscript-3	False	32	32

	Id	CreationDate
Weekday
False	2250067	2250067
True	10990496	10990496

	Year	Tag	WeekendTotal	WeekdayTotal
1715	2015	angular2	218	864
1973	2016	angular2	4575	24302

Reproducing "What Programming Languages Are Used Most on Weekends?"