Interning as a Data Scientist at ParentPaperwork

LinkedIn Photo
Juan Daza

University of Melbourne, Masters of Information Technology student Juan Daza shares his experience interning as a Data Scientist at ParentPaperwork across the Australian summer.

The University of Melbourne currently runs the Tin Alley internship program where students have the opportunity to apply to great companies in Victoria. Tin Alley has been named Australia’s best internship program and I was honored to be a part of this summer’s intake. One of the companies who gave students the opportunity to get hands on experience was ParentPaperwork and I have been working with the team in the role of Data Scientist.

In this blog post, I want to take you through part of my journey and ParentPaperwork’s goal to empower hundreds of schools with tools to make better informed decisions and more importantly, reach a 100% response rate from parents to slips.

I believe that the first step to make informed decisions is to have the facts and information at hand, which is why one of my main tasks was to assist in the development of the real-time analytical dashboard with key metrics for schools to understand their patterns of use and behaviours.

Data visualisation is a powerful tool in any Data Scientist’s tool box, so I set out to explore the data using Python and SQL and through multiple iterations develop simple yet powerful and self-explanatory visualisations.

I found key metrics that would allow schools to understand their current situation and trigger their curiosity. In conjunction with the team at ParentPaperwork I defined the following metrics.

    • Average response rate (AVR): This is defined as the number of parents who respond to the slips sent to them.
    • Average response time (AVT): The average time it takes parents to respond the slips received.
    • Slips returned by due date: The percentage of slips that are returned before the due date.
    • Timely parents: The number of parents that respond before the slip’s due date.
    • Total Number of forms and broadcast sent.I was mindful of the importance of having the right data displayed at the right time which is why I made sure that my data cleansing process was thorough.

In many cases, slips are sent to both parents and only one parent responds which means that if I calculated the response rate based on the available data, then the number would be biased because I would always have one responded slip and one unanswered slip. I had to develop some cleansing techniques to deal with this fact and get a clean and unique dataset from which I could query freely. I tested some powerful Python functions to wrangle ParentPaperwork’s large dataset and to find the correct dataset to achieve what I wanted to do.

I had to take additional considerations whilst analysing other metrics. As an example, while analysing the average response time from our parents I found an interesting fact. My calculations of the arithmetical average response time were of roughly 80 hours. This meant that on average, a parent responded to a slip every 80 hours or around 3 days. This is an amazing turnaround since paper based forms have a turnaround of around two weeks or more.

However, once I drilled down to the data I found some interesting insights. To look further into this, I set out to understand parents behaviour and I analyzed the response rate progression in the first 30 hours. I thought that the best way to do this was to visualise the behavior of this metric within the first hours of having sent it.

Reverse Burndown Chart

Figure 1. Completion Percentage per Time to Complete

We can clearly see how responses behave. At first, there are very timely parents that respond within a couple of hours and as time passes by, slips get more and more responses making the average response rate higher. Eventually it would reach the Global Average Response Rate which is around 75%

After this, I wanted to understand how many responses are received once the slips are sent. I developed a histogram to get an answer to this question.

Histogram

Figure 2. Response Frequency per Time to Respond

As it can be seen from the graph, parent responses have a very long tail, however most responses received occur in the first 30 hours of sending the slips. At the time of writing this article, around 60% of the responses were received during the initial 30 hours.

Taking this into account, I decided to calculate the weighted average based on the amount of responses received in each time slot. Using this method, I found an average of approximately 9 hours which is a very impressive number. Additional considerations such as night hours might be taken into account to adjust the average response time but in this instance, I will not go further into it.

Answers to many more questions were developed by following a thorough process where I drilled down in the data to find useful hints. I created groups of parents and found those who consistently respond late (or don’t respond) and determined based on the time of day what are the response times and response rates overall. All of these features will be rolling out gradually to help schools make better decisions. This comes to show how at ParentPaperwork various techniques are deployed to work towards making it a data driven organisation.

Stay tuned because over the coming months more news about the upcoming Data Analysis module will be released. The new Analytics Dashboard is now available to all ParentPaperwork schools.