Priori Data wants to be the most transparent company in the world of mobile app intelligence.
Our sector is complicated and still relatively young. We know it can be difficult to choose which provider you actually want. The more transparent we are, the easier it is to decide whether we’re a good fit for you.
Rather than throw a bunch of stats at you without any context, we wanted to open our whole data modelling process up. That way, our big headline stats will make more sense.
This post takes you through five dimensions of mobile app intelligence data. If you know these, you can make an informed decision on which mobile app intelligence service best fits your needs.
Let’s begin with the first question people always ask us - how do you estimate the stats in the platform?
The quick version
When we plot revenues and downloads against rankings, we see a curve:
Using this, we can estimate where other apps will fall within this curve.
The longer version
The best version
...and we’ll take you through it, step by step.
There are two main sources of data:
- Data partners. App publishers provide us their data and we give them a discounted access to download and revenue data. Win-win.
- We currently have 23,000+ data partners. (Click here to join them)
- Public data. This comes from the app stores themselves. We collect the data using lots of servers and a fairly complicated coding effort.
- While this data is publicly available, it is expensive and takes hard work to continuously collect and organise it. That’s why people hire us to do it for them.
- Also, this data is public - but only for one day. We’ve been collecting and aggregating the data for more than three years.
We only collect data on the Google Play Store and the Apple App Store.
For now, these two stores make up the vast majority of all the published apps and downloads out there so we aren’t worried about the alternative app stores².
For the main countries, we have high ranking apps in most categories, which makes our estimates significantly more accurate. In emerging markets, our confirmed data is patchier.
Most people care about the top 50 apps! The more partners with apps in the top 50 that we have data for, the better our estimates become for the apps that most people care about.
Apps ranked below the top 50 apps help us estimate data for apps on the long-tail of that chart I showed you earlier.
Therefore, for us, the most important thing isn’t the volume of data partners, but the ranking of the apps we get direct data for.
A secondary point is the span of rank positions we collect. That determines how many rank positions are "in-sample" so to speak. For example: if we observe rank position 10 and rank position 30, we can pretty confidently say what rank positions between 10 and 30 would be earning as well.
The way we measure the accuracy of our data is to use the Median Absolute Percentage Error (MAPE). We calculate the percentage of error between the actual downloads we get from our partners with the estimations we get after data modelization and look at the median for a given country.
In short, the lower the MAPE, the better the accuracy (as long as our sample is big enough³).
The ranking of partner apps is important for estimate accuracy. For scope, country of origin and diversity of categories is vital.
Most services out there have decent coverage of big Western markets. We pride ourselves on sourcing data partners from all over the world: major markets in other regions - like Russia - to great test markets like Canada and Singapore.
Even if we had great partners in every country for, say, Finance apps, our scope would be pretty limited. Because our data partners come from many different categories, our scope is extensive.
You can find our coverage broken down on our public transparency report. That gives you our accuracy by country and by category.
There are two things that prompt us to make major changes to our models:
1. When we get a significant amount of new data from partners.
In this case, we simply refit the model - using an unchanged methodology - for all historical time periods.
The more data we pull in from data partners, the better our estimates get.
So far we’ve had more than 1,400 new partners accounting for 23,000 apps and growing - we definitely want to include that volume of data!
That’s why, when we get significant quantities of data from new partners, we refit our models.
It isn’t all about X amounts of new apps being added to our data set. Getting 100,000 apps ranked below the top 1000 in their category and country wouldn’t significantly improve our estimates.
2. When we change our methodology.
That could be the actual model equation, type of fitting algorithm we're using, or any sort of post-processing (e.g. how we smooth the raw predictions). This also impacts all historical data.
It's typically specified in the release notes what kind of model update it is.
How to use our transparency report
Let’s take a concrete example using our transparency report to see how we can assess the data for a given country: Germany.
Sample data from August 1st to 31st
(Remember: the lower the MAPE, the more accurate the data).
One of the most important and concrete proxies to validate our data accuracy is to look at the best rank observed in the Overall category for a given country.
In the Germany, we observed rank #16 in Google Play Store and #3 in Apple App Store on the whole month of August among around 1.8k partner apps in Google Play Store and 2.7k on Apple App Store.
We cover 98% of the categories in both stores and respectively 51% and 55% of them ranked at least once in the top 10 for their primary category.
As a result, the MAPE is around 13.5% in GP and 19.3% in iOS.
It basically means that among our sample of partner apps in Germany, half of them have a percentage error below 13.5% for GP (19.3% for iOS) when comparing actual and estimated downloads for August 2017.
There is a lot of ambiguity in the mobile app data industry.
Companies do not disclose their algorithms or exact methodologies because that would reveal their competitive advantage (or disadvantage). Plus, it is difficult for most people to put two algorithms side by side and compare them.
Simply publishing algorithms without any context isn’t actually useful for users.
We have broken down the data science underpinning our services into five dimensions. These dimensions are things that can be compared across different competitors in our space.
If you have any further questions, please get in touch. If you want help explaining all this to somebody else in your organisation, please click the link below.
Click the image to download