Yan Han's blog

On Computer Technology

01 Apr 2017

Reproducing "What Programming Languages Are Used Most on Weekends?"

reproducing-what-programming-languages-are-used-most-on-weekends

Reproducing "What Programming Languages Are Used Most on Weekends?"

I have a confession to make. I don't feel that I am a very naturally creative person. I don't mean that my mind is totally empty when it comes to thinking of new ideas, it's just that I find myself entering a state of helplessness whenever I try too hard to come up with totally novel ideas that will make for good publishing material.

I am looking to transition to a more machine learning oriented role for my next job. I need practice, lots of it. I lack ideas. Why not replicate what other people have done and get some hands-on practice in the process?

A few weeks ago, I read https://stackoverflow.blog/2017/02/07/what-programming-languages-weekends/ . It was pretty interesting. Let's try to replicate their results here.

Source code

NOTE: The dataset may be updated from time to time. Hence you may not get the same results as Julia Silge and myself. In particular, there's one part below where we use the statsmodels glm function and there was a PerfectSeparationError. You may or may not get that.

A little note

As much as possible, I am trying to avoid loading all the data into memory because my machine is not very powerful. So there are certain operations that can probably be done easily using libraries but I have avoided doing that.

About the data

The stacklite.zip file on Kaggle is 446MB. After decompressing, there are 2 files, questions.csv and question_tags.csv. questions.csv is 862MB and questions_tags.csv is 844MB. While it is possible to load the entire dataset into memory, there's no need to. I thought of loading the data into MySQL but then thought again and believed there was no need to.

It is pretty obvious that questions.csv stores questions data. The first 6 lines (including the header) look like this:

Id,CreationDate,ClosedDate,DeletionDate,Score,OwnerUserId,AnswerCount
1,2008-07-31T21:26:37Z,NA,2011-03-28T00:53:47Z,1,NA,0
4,2008-07-31T21:42:52Z,NA,NA,458,8,13
6,2008-07-31T22:08:08Z,NA,NA,207,9,5
8,2008-07-31T23:33:19Z,2013-06-03T04:00:25Z,2015-02-11T08:26:40Z,42,NA,8
9,2008-07-31T23:40:59Z,NA,NA,1410,1,58

Pretty self explanatory.

The contents of question_tags.csv however, stumped me a bit:

Id,Tag
1,data
4,c#
4,winforms
4,type-conversion
4,decimal
4,opacity
6,html
6,css
6,css3
6,internet-explorer-7

Initially, I thought it was a list of unique tags. But after checking out Julia Silge's kernel on kaggle and upon closer inspection, I realized the Id column represents the question id and the Tag column is a tag for that question.

Library imports, globals and helper functions

In [1]:
from collections import Counter, defaultdict

import datetime

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline

_ASKED_ON_WEEKDAY = 0
_ASKED_ON_WEEKEND = 1

def _is_weekday(date_string):
    return datetime.datetime.strptime(
        date_string,
        "%Y-%m-%dT%H:%M:%SZ"
    ).weekday() < 5

Check whether data is clean

This is actually something we came back to do after some steps in.

For questions.csv, we check that there are no duplicate question ids.

For question_tags.csv, we check that there are no duplicate tags for each question.

In [5]:
question_ids = set()
questions_csv_clean = True
with open("questions.csv", "r") as f:
    # skip header
    f.readline()
    for line in f:
        str_question_id, _, _, _, _, _, _ = line.strip().split(",")
        question_id = int(str_question_id)
        if question_id in question_ids:
            print("Duplicate question id: {}".format(question_id))
            questions_csv_clean = False
        else:
            question_ids.add(question_id)
print("questions.csv clean? {}".format(questions_csv_clean))
questions.csv clean? True
In [6]:
del question_ids
In [3]:
question_tags = {}
question_tags_csv_clean = True
nr_errors = 0
with open("question_tags.csv", "r") as f:
    # skip header
    f.readline()
    for line in f:
        str_question_id, tag = line.strip().split(",")
        question_id = int(str_question_id)
        if question_id not in question_tags:
            question_tags[question_id] = set([tag])
        elif tag in question_tags[question_id]:
            print("Duplicate tag `{}` for question id {}".format(
                tag, question_id
            ))
            question_tags_csv_clean = False
            nr_errors += 1
            if nr_errors >= 10:
                break
        else:
            question_tags[question_id].add(tag)
print("question_tags.csv clean? {}".format(question_tags_csv_clean))
Duplicate tag `security` for question id 9800
Duplicate tag `localization` for question id 14158
Duplicate tag `xslt` for question id 20863
Duplicate tag `winapi` for question id 54086
Duplicate tag `xslt` for question id 56567
Duplicate tag `fun` for question id 56582
Duplicate tag `web-applications` for question id 56959
Duplicate tag `shell` for question id 72428
Duplicate tag `interview-questions` for question id 85753
Duplicate tag `security` for question id 87393
question_tags.csv clean? False

Uh oh. The question_tags.csv file isn't clean. There are a lot more duplicate tags than what I've shown here but going through the entire data set causes my system to run out of memory and to save space I'm not gonna show you here.

Before we clean up the data, let us see if the question ids in question_tags.csv come in non-decreasing order. It will simplify the cleaning up work if that's the case.

In [3]:
previous_question_id = -10**9
question_ids_non_decreasing = True
with open("question_tags.csv", "r") as f:
    # skip header
    f.readline()
    for line in f:
        str_question_id, _ = line.strip().split(",")
        question_id = int(str_question_id)
        if question_id < previous_question_id:
            print("question ids do not come in non-decreasing order. In particular, {} comes after {}".format(
                question_id, previous_question_id
            ))
            question_ids_non_decreasing = False
            break
        else:
            previous_question_id = question_id
print("Are question ids in question_tags.csv non-decreasing? {}".format(
    question_ids_non_decreasing
))
Are question ids in question_tags.csv non-decreasing? True

Nice. This vastly simplifies the clean-up work in that we only need to store all tags for the current question and upon encountering a larger question id, we write the tags for the current question to the new csv file. Rinse and repeat.

Cleaning up question_tags.csv

NOTE: The code in the following cell creates a question_tags_clean.csv file on your system.

In [2]:
with open("question_tags_clean.csv", "w") as out_f:
    out_f.write("Id,Tag\n")
    with open("question_tags.csv", "r") as in_f:
        # skip header
        in_f.readline()
        current_question_id = None
        current_question_tags = set()
        for line in in_f:
            str_question_id, tag = line.strip().split(",")
            question_id = int(str_question_id)
            if question_id != current_question_id:
                # flush previous question's tags to file
                for t in current_question_tags:
                    out_f.write("{},{}\n".format(current_question_id, t))
                current_question_id = question_id
                current_question_tags = set([tag,])
            else:
                current_question_tags.add(tag)
    if current_question_id is not None:
        for tag in current_question_tags:
            out_f.write("{},{}\n".format(current_question_id, tag))

Diving in

Let us first determine how many unique tags there are.

In [4]:
unique_tags = set()
with open("question_tags_clean.csv", "r") as f:
    # Skip header line
    f.readline()
    for line in f:
        _, tag = line.split(",")
        unique_tags.add(tag)
len(unique_tags)
Out[4]:
58256

Now, let us extract all the tags, along with their associated questions, then only keep those tags with over 20,000 questions.

In [3]:
tags_to_questions = defaultdict(set)
with open("question_tags_clean.csv", "r") as f:
    # Skip header line
    f.readline()
    for line in f:
        question_id, tag = line.strip().split(",")
        tags_to_questions[tag.strip()].add(int(question_id))
over20k_tags_to_questions = {}
for tag, questions_set in tags_to_questions.items():
    if len(questions_set) >= 20000:
        over20k_tags_to_questions[tag] = questions_set
del tags_to_questions
print("# tags with >= 20,000 questions: {}".format(len(over20k_tags_to_questions)))
# tags with >= 20,000 questions: 319

Let's dump this subset of data to a file. We will shutdown this Jupyter notebook and load that smaller data set back in.

In [4]:
with open("over20k_tags_to_questions.txt", "w") as f:
    for tag, questions_set in over20k_tags_to_questions.items():
        f.write("{}:{}\n".format(tag, ",".join(map(str, questions_set))))

Let's load the subset of data back in:

In [5]:
over20k_tags_to_questions = {}
with open("over20k_tags_to_questions.txt", "r") as f:
    for line in f:
        tag, questions = line.split(":")
        over20k_tags_to_questions[tag] = set(map(int, questions.split(",")))

How many questions on weekdays? How many questions on weekends?

In [8]:
nr_questions_on_weekdays = 0
nr_questions_on_weekends = 0
with open("questions.csv", "r") as f:
    # Skip header line
    f.readline()
    for line in f:
        _, creation_date, _, _, _, _, _ = line.strip().split(",")
        if creation_date != "NA":
            if _is_weekday(creation_date):
                nr_questions_on_weekdays += 1
            else:
                nr_questions_on_weekends += 1
In [9]:
print("# questions on weekdays: {}".format(nr_questions_on_weekdays))
print("# questions on weekends: {}".format(nr_questions_on_weekends))
# questions on weekdays: 14227389
# questions on weekends: 2976435

Those are quite different numbers from those in Julia Silge's post:

Overall, this includes 10,451,274 questions on weekdays and 2,132,073 questions on weekends.

Probably the data set was updated some time after the post was written.

Something struck me for the next step. What does the author mean by relative frequency? Relative to what? Turns out she included the definition:

Instead, let’s explore which tags made up a larger share of weekend questions than they did of weekday questions, and vice versa.

So we're gonna go through all the tags with at least 20,000 questions, find out how many of the questions for each tag were posted on weekdays and weekends, then do a comparison.

In [6]:
# Get all question ids for tags with at least 20,000 questions
qns_with_some_tag_over_20k_qns = set()
i = 0
for questions_set in over20k_tags_to_questions.values():
    qns_with_some_tag_over_20k_qns.update(questions_set)
In [4]:
# For questions we are interested in, get whether they are asked during
# weekday or weekend
qoi_to_wday_wend = dict.fromkeys(
    qns_with_some_tag_over_20k_qns,
    -1
)
with open("questions.csv", "r") as f:
    # skip header
    f.readline()
    for line in f:
        str_question_id, creation_date, _, _, _, _, _ = line.strip().split(",")
        question_id = int(str_question_id)
        if question_id in qoi_to_wday_wend and creation_date != "NA":
            if datetime.datetime.strptime(
                creation_date,
                "%Y-%m-%dT%H:%M:%SZ"
            ).weekday() < 5:
                qoi_to_wday_wend[question_id] = _ASKED_ON_WEEKDAY
            else:
                qoi_to_wday_wend[question_id] = _ASKED_ON_WEEKEND


# Now go through all the tags with over 20,000 questions and count
# the # of questions asked during weekdays vs. weekends
tag_to_nr_qns_on_wday = Counter()
tag_to_nr_qns_on_wend = Counter()
for tag, questions_set in over20k_tags_to_questions.items():
    tag_to_nr_qns_on_wday[tag] = tag_to_nr_qns_on_wend[tag] = 0
    for question_id in questions_set:
        if qoi_to_wday_wend[question_id] == _ASKED_ON_WEEKDAY:
            tag_to_nr_qns_on_wday[tag] += 1
        elif qoi_to_wday_wend[question_id] == _ASKED_ON_WEEKEND:
            tag_to_nr_qns_on_wend[tag] += 1
In [5]:
# Get top 20 tags with higher relative frequency of questions asked during weekdays

_INF = 10**9
wday_rel_freq_list = []
wend_rel_freq_list = []
total_nr_wday_qns = sum(tag_to_nr_qns_on_wday.values())
total_nr_wend_qns = sum(tag_to_nr_qns_on_wend.values())
for tag in over20k_tags_to_questions:
    wday_nr_qns = tag_to_nr_qns_on_wday[tag]
    wend_nr_qns = tag_to_nr_qns_on_wend[tag]
    wday_freq = wday_nr_qns / total_nr_wday_qns
    wend_freq = wend_nr_qns / total_nr_wend_qns
    if wend_freq == 0:
        wday_rel_freq_list.append((tag, _INF,))
    else:
        wday_rel_freq_list.append((tag, wday_freq / wend_freq,))
    if wday_freq == 0:
        wend_rel_freq_list.append((tag, _INF,))
    else:
        wend_rel_freq_list.append((tag, wend_freq / wday_freq,))

wday_rel_freq_list.sort(key=lambda t: t[1])
print(wday_rel_freq_list[-20:])

wend_rel_freq_list.sort(key=lambda t: t[1])
print(wend_rel_freq_list[-20:])
[('selenium', 1.6212588979148064), ('wcf', 1.6553375679897844), ('excel', 1.7083207272949648), ('internet-explorer', 1.7140582549884504), ('oracle', 1.7265489637766018), ('excel-vba', 1.7414659008716018), ('svn', 1.7510314654119836), ('sql-server-2008', 1.7575099770444054), ('selenium-webdriver', 1.7661248332256643), ('iis', 1.774721760837079), ('vba', 1.8236177815794705), ('xslt', 1.8283513564212965), ('soap', 1.8400984372730709), ('extjs', 1.8717772104551975), ('powershell', 2.0238479773037628), ('tsql', 2.074704473127169), ('sql-server-2005', 2.1723285999710455), ('jenkins', 2.4758333202566383), ('sharepoint', 2.8651044383099826), ('reporting-services', 3.6477980122694946)]
[('parse.com', 1.3503814514822798), ('math', 1.3551432364197442), ('class', 1.3594760064146303), ('arraylist', 1.3600894357955087), ('firebase', 1.362822521786504), ('recursion', 1.3714568879068183), ('graphics', 1.377539748907886), ('random', 1.3820943836369635), ('methods', 1.3824713337624932), ('data-structures', 1.3913036081744454), ('python-3.x', 1.3983445990946155), ('swing', 1.400912744220479), ('google-chrome-extension', 1.4026948786028015), ('c', 1.4271826837540982), ('mysqli', 1.4483936702618714), ('pointers', 1.5043601718360466), ('algorithm', 1.5074228566575227), ('opengl', 1.604622444469804), ('assembly', 1.663642477327844), ('haskell', 1.683808410440941)]
In [6]:
wday_df = pd.DataFrame(
    list(
        map(
            lambda t: [t[0], float(t[1]) * 100],
            wday_rel_freq_list[-20:][::-1]
        )
    ),
    columns=["tag", "relative_frequency"],
)
wend_df = pd.DataFrame(
    list(
        map(
            lambda t: [t[0], float(t[1]) * 100],
            wend_rel_freq_list[-20:][::-1]
        )
    ),
    columns=["tag", "relative_frequency"],
)
fig, (ax1, ax2,) = plt.subplots(ncols=2, figsize=(15,15,))
sns.barplot(
    x="relative_frequency",
    y="tag", data=wday_df,
    color=sns.xkcd_rgb["coral"],
    ax=ax1,
)
sns.barplot(
    x="relative_frequency",
    y="tag",
    data=wend_df,
    color=sns.xkcd_rgb["teal"],
    ax=ax2,
)
plt.subplots_adjust(wspace=0.2)
plt.xticks([0, 50, 100, 150, 200, 250, 300, 350, 400,])
plt.subplots_adjust(top=0.9)
fig.suptitle(
    "Which tags have the biggest weekday/weekend differences?\nFor tags with more than 20,000 questions",
    size=20,
)
ax1.set_title("Weekdays")
ax1.set_xlabel("")
ax1.set_ylabel("")
ax2.set_title("Weekends")
ax2.set_xlabel("")
ax2.set_ylabel("")
fig.text(0.5, 0.09, "Relative frequency", ha="center", size=18,);

The next part deals with tags with biggest decrease in weekend activity. Here, decrease means the change between years 2016 and 2008 (ignoring trends in between those years).

In [7]:
qns_with_some_tag_over_20k_qns__to__tags = defaultdict(set)
for tag, questions_set in over20k_tags_to_questions.items():
    for question_id in questions_set:
        qns_with_some_tag_over_20k_qns__to__tags[question_id].add(tag)

def _add_to_relevant_count(
        tags_activity_dict,
        question_id,
        creation_date
    ):
    for tag in qns_with_some_tag_over_20k_qns__to__tags[question_id]:
        if tag not in tags_activity_dict:
            tags_activity_dict[tag] = (0, 0,)
        wday_cnt, wend_cnt = tags_activity_dict[tag]
        if _is_weekday(creation_date):
            tags_activity_dict[tag] = (wday_cnt + 1, wend_cnt,)
        else:
            tags_activity_dict[tag] = (wday_cnt, wend_cnt + 1,)

tags_with_over20k_qns_2008_activity = {}
tags_with_over20k_qns_2016_activity = {}
with open("questions.csv", "r") as f:
    # skip header
    f.readline()
    for line in f:
        str_question_id, creation_date, _, _, _, _, _ = line.strip().split(",")
        question_id = int(str_question_id)
        if question_id in qns_with_some_tag_over_20k_qns__to__tags:
            str_year_of_creation = creation_date[:4]
            if str_year_of_creation == "2008":
                _add_to_relevant_count(
                    tags_with_over20k_qns_2008_activity,
                    question_id,
                    creation_date
                )
            elif str_year_of_creation == "2016":
                _add_to_relevant_count(
                    tags_with_over20k_qns_2016_activity,
                    question_id,
                    creation_date
                )

And... in the above step, my computer ran out of memory. Lol. Time to spin up an EC2 instance.

In [30]:
tags_wend_over_wday_ratio = []
tags_with_zero_count = 0
for tag in tags_with_over20k_qns_2008_activity:
    if tag in tags_with_over20k_qns_2016_activity:
        wday_2008, wend_2008 = tags_with_over20k_qns_2008_activity[tag]
        wday_2016, wend_2016 = tags_with_over20k_qns_2016_activity[tag]
        # Let's ignore tags that have 0 counts, because one of the ratios will be
        # infinity
        if wday_2008 != 0 and wend_2008 != 0 and wday_2016 != 0 and wend_2016 != 0:
            tags_wend_over_wday_ratio.append(
                (tag, wend_2008 / wday_2008, wend_2016 / wday_2016,)
            )
tags_wend_over_wday_ratio.sort(
    key=lambda t: t[1] - t[2]
)
print(tags_wend_over_wday_ratio[-10:])
print(sorted(tags_wend_over_wday_ratio, key=lambda t: t[2] - t[1])[-10:])
#print(tags_with_over20k_qns_2008_activity)
print(tags_with_over20k_qns_2008_activity["scala"])
print(tags_with_over20k_qns_2016_activity["scala"])
print(sorted(tags_wend_over_wday_ratio, key=lambda t: t[2])[-10:])
[('canvas', 0.4444444444444444, 0.25225225225225223), ('grails', 0.3442622950819672, 0.14285714285714285), ('numpy', 0.5, 0.24088171798659244), ('magento', 0.5, 0.13784037558685447), ('twitter', 0.625, 0.24564134495641346), ('html5', 0.6666666666666666, 0.22732272069464543), ('multidimensional-array', 0.75, 0.27277716794731066), ('web', 1.0, 0.26509023024268824), ('python-2.7', 1.0, 0.2331143105614069), ('razor', 1.0, 0.15828285488543586)]
[('file-io', 0.12857142857142856, 0.28909090909090907), ('inheritance', 0.09836065573770492, 0.26060452120904243), ('nginx', 0.058823529411764705, 0.22553614543388723), ('opengl', 0.18032786885245902, 0.3578835416112736), ('login', 0.04838709677419355, 0.24481910783280647), ('uitableview', 0.04, 0.25097198015819816), ('swing', 0.12690355329949238, 0.34034754276405105), ('methods', 0.1, 0.31420137750081994), ('javafx', 0.07142857142857142, 0.3006620377845955), ('arraylist', 0.043478260869565216, 0.30917874396135264)]
(21, 6)
(15005, 2988)
[('random', 0.16326530612244897, 0.31997571341833636), ('c', 0.21157495256166983, 0.3249356802608452), ('data-structures', 0.20625, 0.3303324099722992), ('algorithm', 0.18421052631578946, 0.34018177652180553), ('swing', 0.12690355329949238, 0.34034754276405105), ('pointers', 0.25, 0.34140249759846303), ('graphics', 0.20765027322404372, 0.3479816044966786), ('opengl', 0.18032786885245902, 0.3578835416112736), ('haskell', 0.22077922077922077, 0.36882393876130826), ('assembly', 0.32051282051282054, 0.3778652465848576)]

Looks very different from what's in Julia Silge's post. In fact, upon reading the R code in her Kaggle kernel, I found that our approach is completely different. Let's redo everything.

Redo

Turns out that in Julia Silge's kernel, she first filters out all questions whose deletion date is NA. Let's do that and drop unnecessary columns.

In [2]:
questions = pd.read_csv("questions.csv")
questions = questions.loc[questions["DeletionDate"].isnull(), :]
questions.pop("DeletionDate")
questions.pop("ClosedDate")
questions.pop("Score")
questions.pop("OwnerUserId")
questions.pop("AnswerCount")
questions.head()
Out[2]:
Id CreationDate
1 4 2008-07-31T21:42:52Z
2 6 2008-07-31T22:08:08Z
4 9 2008-07-31T23:40:59Z
5 11 2008-07-31T23:55:37Z
6 13 2008-08-01T00:42:38Z

Now we load all the tags for the questions, but only for those questions which have not been deleted. We do this by using the merge method of pandas.DataFrame.

In [3]:
question_tags_clean = pd.read_csv("question_tags_clean.csv")
question_tags_clean = question_tags_clean.merge(
    questions,
    on="Id",
    how="inner",
)
question_tags_clean.pop("CreationDate");

We are only interested in tags which have over 20,000 questions. Time to do some filtering.

In [4]:
tags_with_counts = question_tags_clean.groupby("Tag").count()
tags_with_counts = tags_with_counts.reset_index()
tags_with_counts.columns = ["Tag", "Count"]
tags_with_counts = tags_with_counts[tags_with_counts.Count > 20000]
In [5]:
question_tags_clean = question_tags_clean.merge(
    tags_with_counts,
    on="Tag",
    how="inner",
)
question_tags_clean.pop("Count")

del tags_with_counts

Let's obtain the set of questions with some tag that has over 20,000 questions.

In [6]:
questions_with_tag_over_20k = questions.merge(
    pd.DataFrame({"Id": question_tags_clean["Id"].unique(),}),
    on="Id",
    how="inner",
)
In [7]:
questions_with_tag_over_20k["Weekday"] = \
    questions_with_tag_over_20k["CreationDate"].apply(_is_weekday)
In [8]:
tag_wday_counts = question_tags_clean.merge(
    questions_with_tag_over_20k,
    on="Id",
    how="inner"
).groupby(["Tag", "Weekday"]).count()
In [9]:
tag_wday_counts.head()
Out[9]:
Id CreationDate
Tag Weekday
.htaccess False 10609 10609
True 44636 44636
.net False 32613 32613
True 214170 214170
actionscript-3 False 7179 7179

Let's get rid of the hierarchical index formed by groupby.

In [10]:
tag_wday_counts = tag_wday_counts.reset_index()
tag_wday_counts.head()
Out[10]:
Tag Weekday Id CreationDate
0 .htaccess False 10609 10609
1 .htaccess True 44636 44636
2 .net False 32613 32613
3 .net True 214170 214170
4 actionscript-3 False 7179 7179

The CreationDate column is redundant. Let's remove it.

In [11]:
tag_wday_counts.pop("CreationDate")
tag_wday_counts.columns = ["Tag", "Weekday", "Count"]
tag_wday_counts.head()
Out[11]:
Tag Weekday Count
0 .htaccess False 10609
1 .htaccess True 44636
2 .net False 32613
3 .net True 214170
4 actionscript-3 False 7179

We want something like spread in tidyr. I saw on https://chrisalbon.com/python/pandas_long_to_wide.html that we can use pivot.

In [12]:
tag_wday_counts = tag_wday_counts.pivot(
    index="Tag",
    columns="Weekday",
    values="Count"
)
In [13]:
tag_wday_counts.head()
Out[13]:
Weekday False True
Tag
.htaccess 10609 44636
.net 32613 214170
actionscript-3 7179 33219
activerecord 4093 18690
ajax 28791 133425

Let's remove the hierarchical index using reset_index.

In [14]:
tag_wday_counts = tag_wday_counts.reset_index()
tag_wday_counts.head()
Out[14]:
Weekday Tag False True
0 .htaccess 10609 44636
1 .net 32613 214170
2 actionscript-3 7179 33219
3 activerecord 4093 18690
4 ajax 28791 133425

Let's rename the columns:

In [15]:
tag_wday_counts.columns=["Tag", "Weekend", "Weekday"]
tag_wday_counts.head()
Out[15]:
Tag Weekend Weekday
0 .htaccess 10609 44636
1 .net 32613 214170
2 actionscript-3 7179 33219
3 activerecord 4093 18690
4 ajax 28791 133425

To obtain the rate of questions asked during weekends and weekdays for each tag, we have to calculate the total number of questions asked during weekdays and weekends.

In [16]:
questions["Weekday"] = questions["CreationDate"].apply(_is_weekday)
x = questions.groupby("Weekday").count()
x.head()
Out[16]:
Id CreationDate
Weekday
False 2250067 2250067
True 10990496 10990496
In [17]:
nr_wday_qns = x.loc[True, "Id"]
nr_wend_qns = x.loc[False, "Id"]
In [18]:
x = tag_wday_counts.copy()
x["Weekday"] /= nr_wday_qns
x["Weekend"] /= nr_wend_qns
In [19]:
x[x["Weekday"] == 0].shape
Out[19]:
(0, 3)
In [20]:
x["WeekendOverWeekday"] = x["Weekend"] / x["Weekday"]
In [21]:
x = x.sort_values(by="WeekendOverWeekday", ascending=False,)
x[:16]
Out[21]:
Tag Weekend Weekday WeekendOverWeekday
98 haskell 0.003667 0.002094 1.751218
26 assembly 0.002727 0.001587 1.718643
161 opengl 0.003114 0.001893 1.644761
171 pointers 0.003611 0.002361 1.529601
5 algorithm 0.007548 0.005048 1.495337
35 c 0.024511 0.016914 1.449149
177 python-3.x 0.005104 0.003557 1.434771
217 swing 0.006523 0.004567 1.428346
180 random 0.002008 0.001410 1.424809
183 recursion 0.002406 0.001713 1.404361
19 arraylist 0.002059 0.001486 1.385408
44 class 0.004584 0.003317 1.381949
237 vector 0.001995 0.001449 1.376647
137 math 0.002527 0.001871 1.350148
99 heroku 0.002220 0.001647 1.348040
39 c++11 0.003423 0.002540 1.348003

These are the exact same tags as that in the original post and in the same order. Now let's try getting at the highest weekday / weekend activity tags.

In [22]:
x[-16:][::-1]
Out[22]:
Tag Weekend Weekday WeekendOverWeekday
199 sharepoint 0.000656 0.001825 0.359364
119 jenkins 0.000696 0.001690 0.412106
223 tsql 0.001760 0.003560 0.494494
174 powershell 0.001753 0.003463 0.506278
202 soap 0.000938 0.001722 0.544416
236 vba 0.003418 0.006181 0.552990
76 extjs 0.000967 0.001748 0.553204
209 sql-server-2008 0.002265 0.003982 0.568767
257 xslt 0.001255 0.002195 0.571494
106 iis 0.001230 0.002144 0.573637
163 oracle 0.003731 0.006498 0.574247
73 excel-vba 0.002684 0.004599 0.583745
72 excel 0.005340 0.009138 0.584368
215 svn 0.001171 0.001968 0.595177
111 internet-explorer 0.001700 0.002817 0.603327
208 sql-server 0.009528 0.015332 0.621409

Barring the incorrect numbers in the WeekendOverWeekday column which can be corrected by taking reciprocals, the tags are the same as those in the original post and in the same order.

Now let's plot the bar charts.

In [23]:
highest_wend_plot_df = x[:16].loc[:, ["Tag", "WeekendOverWeekday"]]
highest_wend_plot_df["WeekendOverWeekday"] *= 100

highest_wday_plot_df = x[-16:].loc[:, ["Tag", "WeekendOverWeekday"]][::-1]
highest_wday_plot_df["WeekendOverWeekday"] = 1 / highest_wday_plot_df["WeekendOverWeekday"] * 100
In [24]:
fig, (ax1, ax2,) = plt.subplots(ncols=2, figsize=(12,12,))
sns.barplot(
    x="WeekendOverWeekday",
    y="Tag",
    data=highest_wday_plot_df,
    color=sns.xkcd_rgb["coral"],
    ax=ax1,
)
sns.barplot(
    x="WeekendOverWeekday",
    y="Tag",
    data=highest_wend_plot_df,
    color=sns.xkcd_rgb["teal"],
    ax=ax2,
)
plt.subplots_adjust(wspace=0.2)
plt.xticks([0, 50, 100, 150, 200, 250, 300,])
plt.subplots_adjust(top=0.9)
fig.suptitle(
    "Which tags have the biggest weekday/weekend differences?\nFor tags with more than 20,000 questions",
    size=20,
)
ax1.set_title("Weekdays")
ax1.set_xlabel("")
ax1.set_ylabel("")
ax2.set_title("Weekends")
ax2.set_xlabel("")
ax2.set_ylabel("")
fig.text(0.5, 0.07, "Relative frequency", ha="center", size=18,);

Nice. So that's one mystery solved. Now comes the harder part.

For the next part, we will need to compute for each tag in tags with over 20,000 questions:

  1. The total number of questions with that tag (for each year)
  2. The total number of questions with that tag and were asked during weekends (for each year)

And for each year, we need to compute the number of questions asked during weekends and weekdays. Let's do this first.

In [25]:
# Add year information to `questions`
questions["Year"] = questions["CreationDate"].apply(
    lambda datestring: datestring[:4]
)
In [26]:
year_wday_counts = questions.copy()
year_wday_counts = year_wday_counts.groupby(
    ["Year", "Weekday"]
).count()
year_wday_counts = year_wday_counts.reset_index()
year_wday_counts.pop("Id")
year_wday_counts.rename(
    columns={
        "CreationDate": "Total",
    },
    inplace=True,
)
year_wday_counts.head()
Out[26]:
Year Weekday Total
0 2008 False 7506
1 2008 True 51254
2 2009 False 51778
3 2009 True 293509
4 2010 False 114376

Seems correct enough to me. Moving on.

Earlier we computed:

questions_with_tag_over_20k: the set of all questions with some tag that has over 20,000 questions

tag_wday_counts: the set of all tags where each tag has over 20,000 questions along with the breakdown of the number of questions asked during weekdays and weekends

question_tags_clean: the cleaned data originally from question_tags.csv

Let's take a look at them.

In [27]:
questions_with_tag_over_20k.head()
Out[27]:
Id CreationDate Weekday
0 4 2008-07-31T21:42:52Z True
1 6 2008-07-31T22:08:08Z True
2 9 2008-07-31T23:40:59Z True
3 11 2008-07-31T23:55:37Z True
4 13 2008-08-01T00:42:38Z True
In [28]:
tag_wday_counts.head()
Out[28]:
Tag Weekend Weekday
0 .htaccess 10609 44636
1 .net 32613 214170
2 actionscript-3 7179 33219
3 activerecord 4093 18690
4 ajax 28791 133425
In [29]:
question_tags_clean.head()
Out[29]:
Id Tag
0 4 winforms
1 482 winforms
2 1037 winforms
3 2770 winforms
4 2804 winforms

Time to add year information to questions_with_tag_over_20k.

In [30]:
questions_with_tag_over_20k["Year"] = questions_with_tag_over_20k["CreationDate"].apply(
    lambda datestring: datestring[:4]
)

Now let's breakdown the tags by year and counts during weekday, weekend.

In [31]:
tag_over_20k_year_wend_counts = questions_with_tag_over_20k.merge(
    question_tags_clean,
    how="left",
    on="Id",
).groupby(["Year", "Tag", "Weekday"]).count()
In [32]:
tag_over_20k_year_wend_counts.head()
Out[32]:
Id CreationDate
Year Tag Weekday
2008 .htaccess False 5 5
True 49 49
.net False 719 719
True 5202 5202
actionscript-3 False 32 32
In [33]:
tag_over_20k_year_wend_counts.pop("CreationDate")
tag_over_20k_year_wend_counts.rename(
    columns={"Id": "WeekendTotal"},
    inplace=True,
)
In [34]:
tag_over_20k_year_wend_counts = tag_over_20k_year_wend_counts.reset_index()
In [35]:
tag_over_20k_year_wend_counts.head()
Out[35]:
Year Tag Weekday WeekendTotal
0 2008 .htaccess False 5
1 2008 .htaccess True 49
2 2008 .net False 719
3 2008 .net True 5202
4 2008 actionscript-3 False 32

My pandas skill isn't so awesome. So I'll extract the weekday counts from this dataframe, drop the rows we extracted, then do a merge to insert the weekday counts.

In [36]:
weekday_counts = tag_over_20k_year_wend_counts.loc[
    tag_over_20k_year_wend_counts["Weekday"] == True
]
In [37]:
weekday_counts.head()
Out[37]:
Year Tag Weekday WeekendTotal
1 2008 .htaccess True 49
3 2008 .net True 5202
5 2008 actionscript-3 True 276
7 2008 activerecord True 77
9 2008 ajax True 525
In [38]:
# Drop weekday counts
tag_over_20k_year_wend_counts.drop(
    tag_over_20k_year_wend_counts[
        tag_over_20k_year_wend_counts["Weekday"] == True
    ].index,
    inplace=True,
)
In [39]:
tag_over_20k_year_wend_counts.head()
Out[39]:
Year Tag Weekday WeekendTotal
0 2008 .htaccess False 5
2 2008 .net False 719
4 2008 actionscript-3 False 32
6 2008 activerecord False 7
8 2008 ajax False 74
In [40]:
tag_over_20k_year_wend_counts = tag_over_20k_year_wend_counts.merge(
    weekday_counts,
    how="inner",
    on=["Year", "Tag",],
    suffixes=("_l", "_r",),
)
tag_over_20k_year_wend_counts.head()
Out[40]:
Year Tag Weekday_l WeekendTotal_l Weekday_r WeekendTotal_r
0 2008 .htaccess False 5 True 49
1 2008 .net False 719 True 5202
2 2008 actionscript-3 False 32 True 276
3 2008 activerecord False 7 True 77
4 2008 ajax False 74 True 525
In [41]:
tag_over_20k_year_wend_counts.pop("Weekday_l")
tag_over_20k_year_wend_counts.pop("Weekday_r")
tag_over_20k_year_wend_counts.rename(
    columns={
        "WeekendTotal_l": "WeekendTotal",
        "WeekendTotal_r": "WeekdayTotal",
    },
    inplace=True,
)
tag_over_20k_year_wend_counts.head()
Out[41]:
Year Tag WeekendTotal WeekdayTotal
0 2008 .htaccess 5 49
1 2008 .net 719 5202
2 2008 actionscript-3 32 276
3 2008 activerecord 7 77
4 2008 ajax 74 525

Nice. Now we almost have what we want in tag_over_20k_year_wend_counts and year_wday_counts.

But we need to drop tags whose combined total count in any of its year of occurrence is 20 and below.

A small experiment below shows that the Year column is not of integer type.

In [42]:
tag_over_20k_year_wend_counts[
    (tag_over_20k_year_wend_counts["Year"] == 2008) &
    (tag_over_20k_year_wend_counts["Tag"] == ".htaccess")
]
Out[42]:
Year Tag WeekendTotal WeekdayTotal
In [43]:
tag_over_20k_year_wend_counts.dtypes
Out[43]:
Year            object
Tag             object
WeekendTotal     int64
WeekdayTotal     int64
dtype: object

Let's convert Year to integer.

In [44]:
tag_over_20k_year_wend_counts["Year"] = tag_over_20k_year_wend_counts["Year"].astype(int)
In [45]:
tag_over_20k_year_wend_counts[
    (tag_over_20k_year_wend_counts["Year"] == 2008) &
    (tag_over_20k_year_wend_counts["Tag"] == ".htaccess")
]
Out[45]:
Year Tag WeekendTotal WeekdayTotal
0 2008 .htaccess 5 49

Success!

In [46]:
tags_to_remove = set()
for tag in tag_over_20k_year_wend_counts["Tag"].unique():
    for year in range(2008, 2016 + 1):
        tag_year_df = tag_over_20k_year_wend_counts[
            (tag_over_20k_year_wend_counts["Year"] == year) &
            (tag_over_20k_year_wend_counts["Tag"] == tag)
        ]
        if tag_year_df.shape[0] == 1:
            if tag_year_df.iloc[0]["WeekendTotal"] + tag_year_df.iloc[0]["WeekdayTotal"] <= 20:
                tags_to_remove.add(tag)
                break
print(tags_to_remove)
{'jenkins', 'python-2.7', 'swift', 'maven', 'express', 'symfony2', 'angularjs', 'pandas', 'hadoop', 'css3', 'razor', 'matrix', 'html5', 'numpy', 'unity3d', 'asp.net-web-api', 'asp.net-mvc-4', 'github', 'cordova', 'nginx', 'android-layout', 'web', 'opencv', 'c#-4.0', 'r', 'ruby-on-rails-4', 'core-data', 'azure', 'c++11', 'spring-mvc', 'if-statement', 'node.js'}

Now we remove those tags whose total number of posts in any year is <= 20:

In [47]:
tag_over_20k_year_wend_counts = tag_over_20k_year_wend_counts[
    ~tag_over_20k_year_wend_counts["Tag"].isin(tags_to_remove)
]

We'll also delete data for the year 2017.

In [48]:
tag_over_20k_year_wend_counts = tag_over_20k_year_wend_counts[
    tag_over_20k_year_wend_counts["Year"] != 2017
]

For the next part, we'll be extracting the Year, WeekendTotal and WeekdayTotal data for each tag, then perform the "modeling" (which is totally non-obvious and we'll go through in some detail later on) to find obtain the tags whose weekend proportion have changed the most over the years. To do this, we have to make use of the statsmodel package.

In [49]:
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.sandbox.stats.multicomp as sm_multicomp
In [50]:
unique_tags = tag_over_20k_year_wend_counts["Tag"].unique()
trend_data_for_each_tag = {}
for tag in unique_tags:
    trend_data_for_each_tag[tag] = tag_over_20k_year_wend_counts[
        tag_over_20k_year_wend_counts["Tag"] == tag
    ]

So this is part of the "modeling" Julia Silge mentioned in one sentence of her blog post. This is almost the equivalent of the glm(cbind(nn, YearTagTotal) ~ Year, ., family="binomial") in her R kernel, except that I believe she made a bug and the 2nd element in the cbind should be the number of questions with that tag asked on weekdays for that year, not the total number of questions with that tag asked in the year.

According to the R documentation for glm:

 A typical predictor has the form ‘response ~ terms’ where
 ‘response’ is the (numeric) response vector and ‘terms’ is a
 series of terms which specifies a linear predictor for ‘response’.
 For ‘binomial’ and ‘quasibinomial’ families the response can also
 be specified as a ‘factor’ (when the first level denotes failure
 and all others success) or as a two-column matrix with the columns
 giving the numbers of successes and failures.  A terms
 specification of the form ‘first + second’ indicates all the terms
 in ‘first’ together with all the terms in ‘second’ with any
 duplicates removed.

We got a PerfectSeparationError during the model fitting.

In [51]:
trend_models = {}
for tag, df in trend_data_for_each_tag.items():
    try:
        trend_models[tag] = smf.glm(
            formula="WeekendTotal + WeekdayTotal ~ Year",
            data=df,
            family=sm.families.Binomial(),
        ).fit()
    except statsmodels.tools.sm_exceptions.PerfectSeparationError:
        print(tag)
angular2
In [52]:
trend_data_for_each_tag["angular2"]
Out[52]:
Year Tag WeekendTotal WeekdayTotal
1715 2015 angular2 218 864
1973 2016 angular2 4575 24302

I do not know the exact reason for this. In fact, I do not know what the glm is doing in this case. It seems to be doing logistic regression but for regression and not classification. I can't find any similar functionality in scikit-learn so I had to use the statsmodels library.

The next part extracts the p-values for the Year of each model and then adjusts them, then only retains those tags whose model's adjusted pvalue is less than 0.1

In [53]:
tags_with_year_pvalues = []
for tag, model in trend_models.items():
    tags_with_year_pvalues.append((tag, model.pvalues["Year"],))

adjusted_pvalues = sm_multicomp.multipletests(
    list(map(lambda t: t[1], tags_with_year_pvalues))
)[1]

non_fluke_tags = list(
    map(
        lambda t: t[0],
        filter(
            lambda t: t[1] < 0.1,
            list(
                zip(
                    map(lambda t: t[0], tags_with_year_pvalues),
                    adjusted_pvalues
                )
            )
        )
    )
)

Now we extract the 8 tags whose weekend activity relative to weekday activity has decreased the most.

In [54]:
tag_activity_trend = []
for tag, model in trend_models.items():
    if tag in non_fluke_tags:
        tag_activity_trend.append((tag, model.params["Year"],))
tag_activity_trend.sort(key=lambda t: t[1])
tags_most_decreased_wend_to_wday_activity = tag_activity_trend[:8]
tags_most_increased_wend_to_wday_activity = tag_activity_trend[-8:]
In [55]:
tags_most_decreased_wend_to_wday_activity
Out[55]:
[('asp.net-mvc-3', -0.098378606676268743),
 ('visual-studio-2012', -0.087244426169319045),
 ('ruby-on-rails-3', -0.07524473192117373),
 ('internet-explorer', -0.064543551021871454),
 ('scala', -0.062395774887527344),
 ('go', -0.054768205613629328),
 ('extjs', -0.051498290400011942),
 ('svn', -0.049350481499238874)]
In [56]:
tags_most_increased_wend_to_wday_activity
Out[56]:
[('jquery-mobile', 0.048516221470173815),
 ('swing', 0.050346843127050483),
 ('actionscript-3', 0.050591142495317057),
 ('listview', 0.054707989552278014),
 ('android-fragments', 0.056826010359758039),
 ('button', 0.057474225497405329),
 ('gridview', 0.063615078029569699),
 ('selenium', 0.078265141941906313)]

For the top 8 decreases, Julia Silge's analysis contains the tags azure, ruby-on-rails-4 whereas ours don't. But we have the tags go and extjs. Still pretty similar.

For the top 8 increases, Julia Silge's analysis contains the tags android-layout, unity3d whereas ours don't. But we have the tags jquery-mobile and swing. Pretty similar too.

Let's plot the actual data for these tags.

In [57]:
# Turns out the `Year` column on `year_wday_counts` is not int.
# Let's convert it
year_wday_counts["Year"] = year_wday_counts["Year"].astype(int)

year_wday_qn_counts = year_wday_counts[year_wday_counts["Weekday"]]
year_wday_qn_counts.pop("Weekday")

year_wend_qn_counts = year_wday_counts[year_wday_counts["Weekday"] != True].copy()
year_wend_qn_counts.pop("Weekday");
In [58]:
fig, ax = plt.subplots(figsize=(12,12,))
trend_plot_colors = [
    sns.xkcd_rgb["salmon"],
    sns.xkcd_rgb["tan"],
    sns.xkcd_rgb["dark lime"],
    sns.xkcd_rgb["greenish teal"],
    sns.xkcd_rgb["brown red"],
    sns.xkcd_rgb["azure"],
    sns.xkcd_rgb["light purple"],
    sns.xkcd_rgb["pink"],
]
idx = 0
for tag, _ in tags_most_decreased_wend_to_wday_activity:
    df = trend_data_for_each_tag[tag].copy()
    df = df.merge(
        year_wend_qn_counts,
        on="Year",
        how="left",
    )
    df["WeekendTotal"] /= df["Total"]
    df.pop("Total")
    df = df.merge(
        year_wday_qn_counts,
        on="Year",
        how="left",
    )
    df["WeekdayTotal"] /= df["Total"]
    df.pop("Total")
    df["Ratio"] = df["WeekendTotal"] / df["WeekdayTotal"]
    ax.plot(
        df["Year"],
        df["Ratio"],
        label=tag,
        linewidth=5,
        color=trend_plot_colors[idx],
    )
    idx += 1
leg = ax.legend(fontsize=15,)
leg.set_title("Tag", prop={"size": 18,})
ax.set_ylabel("Relative weekend/weekday use", fontsize=15, labelpad=20,)
ax.set_title(
    "Which tags' weekend activity has decreased the most?",
    fontsize=18,
    fontweight="bold",
)
ttl = ax.title
ttl.set_position([0.5, 1.02])

Of note in this plot:

  • Scala relative weekend over weekday usage sharply declined from 2008 to 2009. From 2011 onwards it is almost a steady downward trend towards 1.
  • Golang has also seen a steady decrease in weekend over weekday relative activity from 2011 to 2016
  • There are 3 Microsoft technologies on this list. asp.net-mvc3, visual-studio-2012 and internet-explorer. Very curious.

Time to do the same plot but for tags whose weekend activity have increased the most.

In [59]:
fig, ax = plt.subplots(figsize=(12,12,))
idx = 0
for tag, _ in tags_most_increased_wend_to_wday_activity:
    df = trend_data_for_each_tag[tag].copy()
    df = df.merge(
        year_wend_qn_counts,
        on="Year",
        how="left",
    )
    df["WeekendTotal"] /= df["Total"]
    df.pop("Total")
    df = df.merge(
        year_wday_qn_counts,
        on="Year",
        how="left",
    )
    df["WeekdayTotal"] /= df["Total"]
    df.pop("Total")
    df["Ratio"] = df["WeekendTotal"] / df["WeekdayTotal"]
    ax.plot(
        df["Year"],
        df["Ratio"],
        label=tag,
        linewidth=5,
        color=trend_plot_colors[idx],
    )
    idx += 1
leg = ax.legend(fontsize=15,)
leg.set_title("Tag", prop={"size": 18,})
ax.set_ylabel("Relative weekend/weekday use", fontsize=15, labelpad=20,)
ax.set_title(
    "Which tags' weekend activity has increased the most?",
    fontsize=18,
    fontweight="bold",
)
ttl = ax.title
ttl.set_position([0.5, 1.02])

Of note in this plot:

  • A good number of tags seem to be mobile oriented. jquery-mobile, listview, android-fragments, button, gridview.
  • swing is seeing increasing weekend usage. Could it also be mobile related?
  • selenium experienced a decline from 2008 to 2012, then seen some resurgence from 2012 to 2016
  • gridview and selenium, despite being on this plot, are still weekday dominant technologies

Time for a final plot that shows all the tags with over 20,000 questions and their relative weekend / weekday activity, overall.

In [60]:
tag_wend_wday_rel_activity = tag_wday_counts.copy()
tag_wend_wday_rel_activity["Total"] = \
    tag_wend_wday_rel_activity["Weekday"] + tag_wend_wday_rel_activity["Weekend"]
tag_wend_wday_rel_activity["Weekend"] /= nr_wend_qns
tag_wend_wday_rel_activity["Weekday"] /= nr_wday_qns
tag_wend_wday_rel_activity["WeekendOverWeekday"] = \
    tag_wend_wday_rel_activity["Weekend"] / tag_wend_wday_rel_activity["Weekday"]
In [67]:
fig, ax = plt.subplots(figsize=(15, 15,))
ax.scatter(
    tag_wend_wday_rel_activity["Total"],
    tag_wend_wday_rel_activity["WeekendOverWeekday"],
)

for tag, coord in zip(
        tag_wend_wday_rel_activity["Tag"],
        zip(
            tag_wend_wday_rel_activity["Total"],
            tag_wend_wday_rel_activity["WeekendOverWeekday"]
        )
    ):
    ax.annotate(tag, coord)

plt.axhline(y=1, linewidth=4, color="red", linestyle="dashed",)
ax.set_xscale("log")
ax.set_xlabel("Total # of questions", fontsize=15, labelpad=20,)
ax.set_ylabel("Relative use on weekends vs. weekdays", fontsize=15,)
ax.set_xticks([1e4, 1e5, 1e6])
ax.set_xticklabels(["1e+04", "1e+05", "1e+06",])
ax.set_yticks([0.5, 1, 2])
ax.set_yticklabels(["1/2 on Weekends", "Same", "2x on Weekends"])
fig.suptitle(
    "Which tags have the biggest weekend/weekday differences?",
    fontsize=20,
    fontweight="bold",
    x=0.44,
)
ttl = ax.set_title(
    "For tags with more than 20,000 questions",
    loc="left",
    fontsize=15,
)
ttl.set_y(1.01)
plt.subplots_adjust(top=0.93)

Imho not a very useful plot because of the clustering in between 10,000 and 100,000 questions. This is probably best for identifying "outliers". We can easily spot the very popular technologies on the right side and technologies which are very weekend or weekday dominant.

Conclusion

This was pretty instructive. Initially we didn't take into account that we should exclude deleted questions. Moreover, we tried too hard to not load everything into memory but ironically ended up using even more memory. Lol. I am very thankful for AWS. The relatively small size of this data set also means that a m4.xlarge instance with 16GB memory is sufficient for our purposes and we have a pretty small bill.

What we've picked up from this

  1. Pandas. I believe I am much better at using it now. But it is quite a beast and I don't know if it has an equivalent of spread in R's dplyr package.
  2. Reading R code. Imho R can be quite a horrible language with confusing documentation meant for pros. But I have to admit that it has some really neat dataframe manipulation functionality, both built-in and from libraries.
  3. Simplifying some of the excessive data computation and storage made in Julia Selge's original blog post. This is done by going through her code and seeing what is actually needed, then making the necessary adjustments.
  4. matplotlib. More experience making plots now.

The downside is that there is almost no machine learning involved. Which is the area I want to get the most hands on experience in.

Follow up questions / items

  1. What algorithm is the glm(cbind(nn, YearTagTotal) ~ Year, data, family="binomial") using? And what do the p values represent?
  2. During the process, we created a lot of pandas dataframes but they're like temporary database tables that are very hard to keep track of and for which we have to do a .head() call or print them out to get the schema. Is there a better way to manage these dataframes?

Disclaimer: Opinions expressed on this blog are solely my own and do not express the views or opinions of my employer(s), past or present.

comments powered by Disqus