Analyzing Android Apps at Scale: A Study of Top 50,000 Apps on Google Play Store (Part 1)

The growth of smartphone usage worldwide is nothing less than remarkable, with Android being the most popular mobile Operating System globally. According to the latest statistics, Android’s market share of smartphones worldwide was 71.4% in 2023 Q1. The numbers do not stop here; Android phone users lead the charts in most countries, including the US. The rapid increase in Android phone usage is mainly attributed to the open-source nature and affordability, making it easily accessible to people of all classes.

With the ever-growing number of Android users and applications, the concern for user security and privacy grows too. The question “How secure all these applications are?” is what we will try to answer in this blog post, through a mass scan of the top 50,000 most downloaded Android applications from the Google Play Store since the beginning.

In our effort to analyze these applications, we faced many challenges. A primary motive of this publication is to enable researchers from the community to reproduce the findings or take them to the next level. Therefore, this blog not only discusses what was done and what was found, but also how we did it. Let’s dig into it!

Objectives

In order to analyze security issues and misconfigurations in the top 50,000 applications, we had two primary prerequisites –

  1. List of package names for the top 50,000 most downloaded applications from Google Play Store (Eg. com.whatsapp is the package name for WhatsApp). But, there’s no publicly and openly available ranked (based on the number of downloads/installs) list of Android applications. Moreover, Google Play Store scraping is subject to limitations since Google only lists around 500-600 applications in each Play Store category (Android and Web). This ranking is also not subject to reviews, installs, etc. Instead, it considers factors like trending, recent popularity, etc.
  2. Source code of the top 50,000 most downloaded applications from Google Play Store – The only way to access the source code is to download, de-compile and reverse the APK files.

Both of the above-listed problem statements were tricky.

For the first problem, we searched for free lists of ranked applications in terms of the number of installs/downloads, but the minimum number of applications we wanted to take into consideration (50,000) was way more than what was available on the Internet. An example of this was the Android Rank list with just 500 applications. The inevitable solution to this problem was to create our own list by scraping the metadata of all the applications we knew using the Google Play Store. But, before that, we would need an extensive list of all the package names we can gather that possibly exist on the Google Play Store.

For the second, we decided to extract AndroidManifest.xml for most of our analysis. We also considered reversing the Java classes and running them through semgrep rules, which can be covered in a different post.

To proceed, from a high-level perspective, we formulated the following steps:

  1. Fetch as many package names as possible.
  2.  Validate these packages for presence on Google Play Store.
  3.  Fetch the metadata of the validated packages (primarily the maximum number of downloads/installs).
  4.  Rank the applications using the metadata and create a list of the top 50,000 most downloaded Android applications.
  5.  Download the APK file for the above applications.
  6.  Extract AndroidManifest.xml and import it into a database.
  7.  Analyze the data and deduce results.

Approach

Step 1 – Fetching Package Names

This was the step where we played around a lot. Be it Google Play store or third-party Android app stores like APKPure, APKMirror, apk-dl, Aptoid, etc., we extensively tested all platforms to query for packages, and to the best of our interest, we found that APKPure’s sitemap.xml exposes precisely what we were looking for.

There are thousands of gunzip’ped XML files with hundreds of package names inside each XML file.

Using a simple bash script, a list of 12,102,707 package names was generated. This list represents the packages that are available on APKPure and not Google Play Store. But, since we have the package names for more than 12 million Android applications, we can use these package names and query them on Google Play Store. If the application is available on Google Play Store, it will be considered a valid application; else, we’ll remove it from the list.

Note that we also fetched the files listed in sitemap.xml on Google Play Store and apk-dl but we didn’t find any valid package name in the case of Google Play Store, and a lot of popular apps were missing from the package names retrieved from apk-dl.

Step 2 – Validating Package Names

It’s easier said than done” is what we’ve observed throughout this journey.

Theoretically, if an HTTP request to https://play.google.com/store/apps/details?id=com.whatsapp (where com.whatsapp is the package name) returns 200 as the status code in the HTTP response, that means the package exists on Google Play Store, else, it does not. Simple, right? No, that’s where Google’s rate limiting kicks in. This is when status codes like 204 and 429 were observed.

After diving deep into the matter and several trials and errors, it seemed like a basic IP address-based rate limiting. As a workaround, we used FireProx, which uses AWS API Gateways to proxy requests to any proxied URL endpoint. With each request, a new IP from the AWS pool of public IPv4 addresses is used, which was exactly what we needed to overcome Google Play Store’s rate limiting.

This is what FireProx in action looks like:

Using FireProx and a go program, we generated a list of validated package names that were available on the Google Play Store. The result was a total of 3,560,749 applications. But, we still do not know the top 50,000 most downloaded applications from this list. For the same, we were required to fetch the metadata for more than 3 million applications from Google Play Store.

Note: AWS API Gateway is a paid service with additional charges for the VPC outbound traffic. Read more about it here.

Step 3 – Getting Metadata

With this step, we required working around Google Play Store’s rate limiting again, which we already sorted out in the previous step. In our best interest, we found an open source project called google-play-scraper written in Golang that we used to fetch the metadata for our validated package list. But, the project required some modification as the official code base uses the URL https://play.google.com/store/apps/details for scraping the metadata, which is rate limited. We modified this part of the code to include the FireProx-generated URL(s) from the previous step. Our modified version of google-play-scraper can be found here.

Using the FireProx and a go program, we fetched metadata for the list of validated packages we got from the previous step. The output data looks like the following:

The output JSON file was imported into a MongoDB instance using the following command

mongoimport -h <host> -d <db_name> -c <collection-name> -u <user> -p <password> --authenticationDatabase <auth_db_name> --file <output.json>

Step 4 – Ranking

To get the top 50,000 most downloaded Google Play Store applications, we ranked the data that was populated into MongoDB using a Jupyter Notebook available here. In the given metadata, there’s a JSON key, “InstallsMax” reflecting the total maximum number of downloads of an application since its launch. We simply sorted all the applications basis on this value in descending order and exported the top 50,000 results into a CSV file, which was later converted to a TXT file with line-separated package names in the ranked order.

Step 5 – Downloading APK(s)

In simpler terms, to download an APK using the package name, one needs to request https://d.apkpure.com/b/APK/com.whatsapp?version=latest (behind Cloudflare) which redirects the client to another URL on another domain, winudf.com (no Cloudflare protection) which looks like https://d-11.winudf.com/b/APK/Y29tLmhhcHB5bW9kLmFwa18xNjFfZTBlZDI3YjE?_fn=SGFwcHlNb2RfMi45LjJfQXBrcHVyZS5hcGs&_p=Y29tLmhhcHB5bW9kLmFwaw%3D%3D&download_id=1823403701104292&is_hot=true&k=43ef6b204bee3d411b71a337b4141824649a3f0d. Our approach was to query the first link with our list of package names and get the final download link (*.winudf.com) for each application. The final links would be queued to a CLI download utility like aria2c to download the APK files.

As simple as it might sound, this was the most resource-intensive step, both in terms of storage and Internet bandwidth. But that was not actually the primary problem statement for us. It was rather the fact that APKPure (*.apkpure.com) uses Cloudflare to

  1. Detect popular bot traffic
  2. Rate limit the number of requests

First, we figured out that using a proper browser User-Agent like Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0 allows us to bypass the bot detection and CAPTCHA challenges.

For the second, AWS API Gateway should’ve done the trick, right? Unfortunately not. According to our test cases and understanding, Cloudflare does not necessarily treat AWS traffic well. FireProx broke because of Cloudflare’s bot detection and CAPTCHA challenge page. We tried injecting the User-Agent into the AWS API Gateway proxy, but that didn’t work either.

As a solution, we tried other cloud services with potential IP rotation, like Cloudflare Workers, AWS Lambda, and Digital Ocean Functions. Our findings were –

  1. Cloudflare Workers do not rotate IP on each outbound request. Multiple Workers were created to use the same outbound IP
  2. AWS Lambda rotates IP on function update, which can be invoked with each function trigger, but, we could not bypass APKPure’s Cloudflare protection with this.
  3. Digital Ocean functions do not rotate IP with each outbound request, but each function creates routes for the outbound traffic using a varied IP.

That was it. We just needed a lot of Digital Ocean Python functions under the same namespace to fetch us the *.winudf.com final download link against each package name. The function code can be found here.

Using a Python script, we downloaded the APK files for the top 50,000 applications.

Step 6 – Extracting AndroidManifest.xml

For extracting out AndroidManifest.xml, we used axmldec. The resultant XML outputs were converted to JSON using a go program.

The output JSON file (manifest.json) was imported into a MongoDB instance using the following command

mongoimport -h <host> -d <db_name> -c <collection-name> -u <user> -p <password> --authenticationDatabase <auth_db_name> --file <output.json>

Step 7 – Analysis

We used this Jupyter Notebook to deduce the statistics discussed below.

Results

1. How many applications had android:debuggable flag set to true?

Out of the top 50,000 applications on Google Play Store, we found that 5 (0.01%) of them are debuggable. It is interesting to note that although Google Play Store does not let a developer upload an APK where android:debuggable flag is set to true, the 5 applications we found were significantly old, latest of them being last updated in 2018.

This number was derived by importing the MongoDB collection into a Panda dataframe and filtering out rows where application.-debuggable was found to be true.

The android:debuggable attribute defines whether the application can be debugged or not. If an application is intended as debuggable, then the application data can be accessed by assuming the privileges of that application and can execute arbitrary code under that application’s permission. In the case of a non-debuggable application, the device needs to be rooted first to extract any data.

Android applications that are not in the production state are expected to have this attribute set to true to assist the developers however before the actual release of the application this tag should be set to false.

2. How many applications had android:allowBackup flag set to true?

Out of the top 50,000 applications on Google Play Store, we found that 21,285 (42.57%) of them allow backups.

This number was derived by importing the MongoDB collection into a Panda dataframe and filtering out rows where application.-allowBackup was found to be true.

The android:allowBackup attribute defines whether application data can be backed up and restored by a user who has enabled USB debugging. If the backup flag is set to true, it allows an attacker to take the backup of the application data via adb even if the device is not rooted. Therefore applications that handle and store sensitive information such as card details, passwords, etc. should have this setting explicitly set to false to prevent such risks.

3. How many applications had android:exported flag for activity, receiver, service and/or provider set to true?

Out of the top 50,000 applications on Google Play Store, we found that 44,799 (89.598%) of them have an activity, service, content provider, or/and broadcast receiver exported.

This number was derived by importing the MongoDB collection into a Panda dataframe and filtering out rows where application.activity.-exported, application.receiver.-exported, application.service.-exported or application.provider.-exported was found to be true.

Depending on the functionality, an application can launch a service, perform an activity, receive content from another source, or receive intents by phone or by other applications. These can all be exported. Therefore, all of them should be reviewed to ensure they don’t perform any sensitive action and that they are protected by appropriate permissions as otherwise information could be exposed to malicious third parties.

4. How many applications had android:usesCleartextTraffic flag set to true?

Out of the top 50,000 applications on Google Play Store, we found that 13,684 (27.368%) of them use clear text traffic.

This number was derived by importing the MongoDB collection into a Panda dataframe and filtering out rows where application.-usesCleartextTraffic was found to be true.

An unsecured communications channel between an app and any back-end services can expose the data transmitted between them. Android 6.0 and later makes it easier to prevent an app from using cleartext network traffic (e.g., HTTP and FTP without TLS) by setting the android:usesCleartextTraffic attribute to false.

5. How many applications had android:protectionLevel flag set to normal or dangerous but not signature?

Out of the top 50,000 applications on Google Play Store, we found that 48,513 (97.026%) of them have permission protection levels set to either normal or dangerous but not signature.

This number was derived by importing the MongoDB collection into a Panda dataframe and filtering out rows where permission.-protectionLevel was not found to be signature.

The android:protectionLevel attribute defines the procedure that the system should follow before it grants permission to the application that has requested it. Four values can be used with this attribute:

  • normal – this permission is “not visible” to the user. A user doesn’t know that the app is requesting access to this resource because the rights are automatically granted during installation.
  • dangerous – this level requires the user to confirm if an app can access a particular resource.
  • signature – the app that uses this permission level must be signed with the same certificate as the app that declared it. This prevents an attacker from creating a uses-permission declaration for this level because Android will not allow the app to be installed.

All the permissions that the application requests should be reviewed to ensure that they don’t introduce a security risk.

6. How many applications had android:minSdkVersion flag set to an integer less than or equal to 15?

Out of the top 50,000 applications on Google Play Store, we found that 3,232 (6.464%) of them require the minimum SDK version to be less than or equal to 15.

This number was derived by importing the MongoDB collection into a Panda dataframe and filtering out rows where uses-sdk.-minSdkVersion was found not to be less than or equal to 15.

The android:minSdkVersion attribute defines the minimum SDK version that is compatible with the application using an API-level integer. Anything below 15 (Android 4.0.3) is considered obsolete.

7. What are the top 10 packages requiring most number of permissions?

Package NameNumber of Permissions
com.google.android.gms286
com.excean.parallelspace236
com.trendmicro.tmas236
com.excelliance.multiaccounts235
com.excelliance.multiaccount235
com.excean.parallelspace.b32231
com.excelliance.multiaccounts.b32231
com.excelliance.multiaccount.assist231
dual.multi.clone.apps.accounts.lnstagram.whatsapp231
com.samsung.android.waterplugin224

Google Play services top the chart for the most number of permissions required. This is justified because it is one of Android smartphones’ core components. It is followed by a couple of applications that enable an Android user to use multiple isolated instances of the same application to use/login multiple accounts. These applications perform user land isolation and management tasks requiring a vast number of permissions. The appearance of the Galaxy Watch4 Plugin app is interesting to note as well. The tight ecosystem between the smartphone and the smartwatch can be the reason why this application requires more than 200 permissions.

The numbers are derived by taking into consideration u003cuses-permission android:name attribute.

8. What are the top 10 dangerous permissions?

Permission NameNumber of Packages
android.permission.WRITE_EXTERNAL_STORAGE27304
android.permission.READ_EXTERNAL_STORAGE22401
android.permission.CAMERA11675
android.permission.READ_PHONE_STATE10949
android.permission.ACCESS_COARSE_LOCATION10282
android.permission.ACCESS_FINE_LOCATION9938
android.permission.RECORD_AUDIO6706
android.permission.GET_ACCOUNTS4605
android.permission.CALL_PHONE2206
android.permission.WRITE_CONTACTS1365

Given are the top 10 permissions with dangerous protection level. More than 20,000 applications out of the top 50,000 applications on Google Play Store require WRITE_EXTERNAL_STORAGE and READ_EXTERNAL_STORAGE runtime permission. nnThe numbers are derived by taking into consideration u003cuses-permission android:name attribute.

9. What are the top 10 packages with most number of imported libraries?

Package NameNumber of Libraries Imported
com.sonyericsson.music15
com.tcl.MultiScreenInteraction_TV15
com.tencent.mm12
com.tcl.tvweishi11
com.sec.android.app.music10
com.google.android.apps.camera.services9
com.tcl.usercenter9
com.tcl.waterfall.overseas9
com.vzw.hss.myverizon9
com.delta.mobile.android8

Music by Sony Corporation and MagiConnect T-cast TV Services tops the chart for the most number of libraries imported. Both the applications use 15 third-party libraries.nnThe numbers are derived by taking into consideration u003cuses-library android:name attribute.

9. What are the top 10 most popular libraries imported?

Library NameNumber of Packages
androidx.window.extensions6186
androidx.window.sidecar6186
org.apache.http.legacy4418
android.test.runner552
androidx.camera.extensions.impl338
android.test.base270
com.google.android.maps270
android.test.mock253
com.sec.android.app.multiwindow246
xperiaplaycertified73

Given are the top 10 shared libraries that are most used by the top 50,000 applications on Google Play Store. JetPack and Apache HTTP Client are used by over 4,000 applications. The numbers are derived by taking into consideration u003cuses-library android:name attribute.

10. What are the top 10 permissions used in the given data set?

Permission NameNumber of Applications
android.permission.INTERNET47798
android.permission.ACCESS_NETWORK_STATE47522
android.permission.WAKE_LOCK43209
android.permission.FOREGROUND_SERVICE32649
android.permission.RECEIVE_BOOT_COMPLETED31344
android.permission.ACCESS_WIFI_STATE30106
android.permission.WRITE28085
android.permission.READ27801
android.permission.WRITE_EXTERNAL_STORAGE27304
android.permission.VIBRATE23903

Given are the top 10 permissions in general that are most used. An interesting thing to note is that only 1 of these is a part of the dangerous protection level permissions and 6 are part of the normal protection level permissions.nnnThe numbers are derived by taking into consideration u003cuses-permission android:name attribute.

11. What are the top 10 API(s) attributes?

API AttributeNumber of Instances
com.google.android.backup.api_key658
com.amap.api.v2.apikey105
com.vivo.push.api_key75
com.baidu.lbsapi.API_KEY68
api_key44
spay_debug_api_key30
appsflyer_api_key19
com.snap.camerakit.api.token16
com.calldorado.apiToken15
net.singular.api_key15

AndroidManifest.xml contains meta-data which is essentially a name-value pair for an item of additional, arbitrary data that can be supplied to the parent component. We looked for various keywords in the keys of meta-data, API being one of them. We then eliminated null values to get a list of the top 10 API attributes found in the meta-data.nnAs a result, we found more than 600 com.google.android.backup.api_key API keys and over a 100 com.amap.api.v2.apikey API keys. More are listed in the table above.

12. What are the top 10 other Secret(s) attributes?

Secret AttributeNumber of Instances
google.client.secret24
com.alibaba.app.appsecret17
com.garena.sdk.twitter.secret17
com.twitter.sdk.secret15
avatye_appsecret13
com.here.sdk.access_key_secret9
com.movile.faster.sdk.application_secret7
aws.rekognition.secretKey5
gateway_secret_test4
gateway_secret_online4

Similar to the above, we queried for the keyword secret in the meta-data keys and got the result above.

13. What are the top 10 other Key(s) attributes?

Key AttributeNumber of Instances
applovin.sdk.key5984
and_aiolos_google_appkey157
and_aiolos_appkey156
igaworks_hash_key98
igaworks_app_key95
AppsFlyerLib_key49
trackingio_app_key38
adpopcorn_ssp_app_key38
adpopcorn_app_key36
presage_key36

Finally, we queried for the keyword key in the meta-data keys and got the result above.

Conclusion

Through this research study, we came across a lot of interesting insights about top 50,000 Android applications on Google Play Store. It is interesting to analyze real instances of some of the bad practices in Android application development. In Part 1, we only focused on the AndroidManifest.xml related insights. Code based vulnerabilities, hard-coded secrets, cryptographic weaknesses etc. will be analyzed through Part 2 that is in continuation with this blog post. So, stay tuned!

Subscribe to our Newsletter
Subscription Form
DOWNLOAD THE DATASHEET

Fill in your details and get your copy of the datasheet in few seconds

CTI Report
DOWNLOAD THE EBOOK

Fill in your details and get your copy of the ebook in your inbox

Ebook Download
DOWNLOAD A SAMPLE REPORT

Fill in your details and get your copy of sample report in few seconds

Download ICS Sample Report
DOWNLOAD A SAMPLE REPORT

Fill in your details and get your copy of sample report in few seconds

Download Cloud Sample Report
DOWNLOAD A SAMPLE REPORT

Fill in your details and get your copy of sample report in few seconds

Download IoT Sample Report
DOWNLOAD A SAMPLE REPORT

Fill in your details and get your copy of sample report in few seconds

Download Code Review Sample Report
DOWNLOAD A SAMPLE REPORT

Fill in your details and get your copy of sample report in few seconds

Download Red Team Assessment Sample Report
DOWNLOAD A SAMPLE REPORT

Fill in your details and get your copy of sample report in few seconds

Download AI/ML Sample Report
DOWNLOAD A SAMPLE REPORT

Fill in your details and get your copy of sample report in few seconds

Download DevSecOps Sample Report
DOWNLOAD A SAMPLE REPORT

Fill in your details and get your copy of sample report in few seconds

Download Product Security Assessment Sample Report
DOWNLOAD A SAMPLE REPORT

Fill in your details and get your copy of sample report in few seconds

Download Mobile Sample Report
DOWNLOAD A SAMPLE REPORT

Fill in your details and get your copy of sample report in few seconds

Download Web App Sample Report

Let’s make cyberspace secure together!

Requirements

Connect Now Form

What our clients are saying!

Trusted by