Leaky Forms: A Study of Email and Password Exfiltration Before Form Submission
Presented at USENIX Security'22
Email addresses—or identifiers derived from them—are known to be used by data brokers and advertisers for cross-site, cross-platform, and persistent identification of potentially unsuspecting individuals. In order to find out whether access to online forms are misused by online trackers, we present a measurement of email and password collection that occur before form submission on the top 100K websites.
Paper » Source code » Browser add-on »📈 Highlights
- Users' email addresses are exfiltrated to tracking, marketing and analytics domains before form submission and before giving consent on 1,844 websites when visited from the EU and 2,950 when visited from the US.
- We found incidental password collection on 52 websites by third-party session replay scripts. (These issues were fixed thanks to our disclosures).
- In a follow-up investigation, we found that Meta (formerly, Facebook) and TikTok collect hashed personal information from web forms even when the user does not submit the form and does not give consent.
📽️ Screen Captures
Full list of screen captures:
🏁 Findings
Top ten websites where email addresses are leaked to tracker domains
EU | |||
---|---|---|---|
Rank | Website | Third-party | Hash/encoding/compression |
154 | *usatoday.com | taboola.com | Hash (SHA-256) |
242 | *trello.com | bizible.com | Encoded (URL) |
243 | *independent.co.uk | taboola.com | Hash (SHA-256) |
300 | shopify.com | bizible.com | Encoded (URL) |
328 | marriott.com | glassboxdigital.io | Encoded (BASE-64) |
567 | *newsweek.com | rlcdn.com | Hash (MD5, SHA-1, SHA-256) |
705 | *prezi.com | taboola.com | Hash (SHA-256) |
754 | *branch.io | bizible.com | Encoded (URL) |
1,153 | prothomalo.com | facebook.com | Hash (SHA-256) |
1,311 | codecademy.com | fullstory.com | Unencoded |
1,543 | *azcentral.com | taboola.com | Hash (SHA-256) |
US | |||
---|---|---|---|
Rank | Website | Third-party | Hash/encoding/compression |
95 | issuu.com | taboola.com | Hash (SHA-256) |
128 | businessinsider.com | taboola.com | Hash (SHA-256) |
154 | usatoday.com | taboola.com | Hash (SHA-256) |
191 | time.com | bouncex.net | Compression (LZW) |
196 | udemy.com |
awin1.com | Hash (SHA-256 with salt) |
zenaps.com | Hash (SHA-256 with salt) | ||
217 | healthline.com | rlcdn.com | Hash (MD5, SHA-1, SHA-256) |
234 | foxnews.com | rlcdn.com | Hash (MD5, SHA-1, SHA-256) |
242 | trello.com | bizible.com | Encoded (URL) |
278 | theverge.com | rlcdn.com | Hash (MD5, SHA-1, SHA-256) |
288 | webmd.com | rlcdn.com | Hash (MD5, SHA-1, SHA-256) |
*: Not reproducible anymore as of February 2022.
1. Site rank: Tranco rank 2. Encoding: Encoding or hash algorithm used when sending the email 3. Website: Hostname of the initially visited website (before a potential redirection) 4. Request Domain: eTLD+1 of the leaky request URL 5. Third Party Entitiy: Owner of the tracker domain 6. Tracker Category: Category of the tracker domain, this information comes from DuckDuckGo's Tracker Radar dataset 7. Blocklist: Blocklist that detected the tracker. WTM: whotracks.me, uBO: uBlock Origin, DDG: DuckDuckGo, DS: Disconnect 8. Page URL: (last_page) URL of the page where our crawler filled the email field 9. XPath: XPath of the filled email field 10. Id:ID of the email element that our crawler filled (10, 11 & 12 identify the leaking page and input elements. They can be used for debugging, reproduction etc.)
1. Site rank: Tranco rank 2. Encoding: Encoding or hash algorithm used when sending the email 3. Website: Hostname of the initially visited website (before a potential redirection) 4. Request Domain: eTLD+1 of the leaky request URL 5. Third Party Entitiy: Owner of the tracker domain 6. Tracker Category: Category of the tracker domain, this information comes from DuckDuckGo's Tracker Radar dataset 7. Blocklist: Blocklist that detected the tracker. WTM: whotracks.me, uBO: uBlock Origin, DDG: DuckDuckGo, DS: Disconnect 8. Page URL: (last_page) URL of the page where our crawler filled the email field 9. XPath: XPath of the filled email field 10. Id:ID of the email element that our crawler filled (10, 11 & 12 identify the leaking page and input elements. They can be used for debugging, reproduction etc.)
Top tracker domains - Email leaks
EU | ||||
---|---|---|---|---|
Entity Name |
Tracker Domain |
Num. sites |
Prom. | Min. Rank |
Taboola | taboola.com | 327 | 302.9 | 154 |
Adobe | bizible.com | 160 | 173.0 | 242 |
FullStory | fullstory.com | 182 | 75.6 | 1311 |
Awin Inc. | zenaps.com | 113 | 48.7 | 2043 |
Awin Inc. | awin1.com | 112 | 48.5 | 2043 |
Yandex | yandex.com | 121 | 41.9 | 1688 |
AdRoll | adroll.com | 117 | 39.6 | 3753 |
Glassbox | glassboxdigital.io | 6 | 31.9 | 328 |
Listrak | listrakbi.com | 91 | 24.9 | 2219 |
Oracle | bronto.com | 90 | 24.6 | 2332 |
LiveRamp | rlcdn.com | 11 | 20.0 | 567 |
SaleCycle | salecycle.com | 35 | 17.5 | 2577 |
Automattic | gravatar.com | 38 | 16.7 | 2048 |
facebook.com | 21 | 14.8 | 1153 | |
Salesforce | pardot.com | 36 | 30.8 | 2675 |
Oktopost | okt.to | 31 | 11.4 | 6589 |
US | ||||
---|---|---|---|---|
Entity Name |
Tracker Domain |
Num. sites |
Prom. | Min. Rank |
LiveRamp | rlcdn.com | 524 | 553.8 | 217 |
Taboola | taboola.com | 383 | 499.0 | 95 |
Bounce Exchange |
bouncex.net | 189 | 224.7 | 191 |
Adobe | bizible.com | 191 | 212.0 | 242 |
Awin | zenaps.com | 119 | 212.0 | 196 |
Awin | awin1.com | 118 | 111.2 | 196 |
FullStory | fullstory.com | 230 | 105.6 | 1311 |
Listrak | listrakbi.com | 226 | 66.0 | 1403 |
LiveRamp | pippio.com | 138 | 65.1 | 567 |
SmarterHQ | smarterhq. | 32 | 63.8 | 556 |
Verizon Media | yahoo. | 255 | 62.3 | 4281 |
AdRoll | adroll.com | 122 | 48.6 | 2343 |
Yandex | yandex.ru | 141 | 48.1 | 1648 |
Criteo SA | criteo.com | 134 | 46.0 | 1403 |
Neustar | agkn.com | 133 | 45.9 | 1403 |
Oracle | addthis.com | 133 | 45.9 | 1403 |
Crawls conducted in May’21. We use prominence to sort third parties in this table because it better represents the scale of a given third party’s reach.
Top tracker domains - Password leaks
EU | ||||
---|---|---|---|---|
Entity Name |
Tracker Domain |
Num. sites |
Prom. | Min. Rank |
Yandex | yandex.com | 37 | 12.12 | 4699 |
Yandex | yandex.ru | 7 | 2.41 | 12989 |
Mixpanel | mixpanel.com | 1 | 0.12 | 84547 |
LogRocket | lr-ingest.io | 1 | 0.12 | 82766 |
US | ||||
---|---|---|---|---|
Entity Name |
Tracker Domain |
Num. sites |
Prom. | Min. Rank |
Yandex | yandex.ru | 45 | 17.23 | 1688 |
Mixpanel | mixpanel.com | 1 | 0.12 | 84547 |
LogRocket | lr-ingest.io | 1 | 0.12 | 82766 |
Crawls conducted in May’21. We use prominence to sort third parties in this table because it better represents the scale of a given third party’s reach.
Website categories: Per-category number of websites we crawled, filled, and observed an email leak to a tracker domain
Categories | EU/US Sites | EU Filled sites | EU Leaky sites | EU % (Leaky / Filled) | US Filled sites | US Leaky sites | US % (Leaky / Filled) |
---|---|---|---|---|---|---|---|
Fashion/Beauty | 1669 | 1176 | 131 | 11.1 | 1179 | 224 | 19.0 |
Online Shopping | 5395 | 3658 | 345 | 9.4 | 3744 | 567 | 15.1 |
General News | 7390 | 3579 | 235 | 6.6 | 3848 | 392 | 10.2 |
Software/Hardware | 4933 | 2834 | 138 | 4.9 | 2855 | 162 | 5.7 |
Business | 13462 | 7805 | 377 | 4.8 | 7924 | 484 | 6.1 |
Marketing/Merchandising | 4964 | 3167 | 119 | 3.8 | 3218 | 192 | 6.0 |
Internet Services | 7974 | 4627 | 171 | 3.7 | 4671 | 199 | 4.3 |
Travel | 2519 | 1355 | 46 | 3.4 | 1379 | 82 | 5.9 |
Health | 2516 | 1389 | 44 | 3.2 | 1439 | 69 | 4.8 |
Finance/Banking | 3699 | 1505 | 41 | 2.7 | 1518 | 49 | 3.2 |
Sports | 1910 | 1044 | 28 | 2.7 | 1002 | 56 | 5.6 |
Portal Sites | 1544 | 682 | 17 | 2.5 | 694 | 19 | 2.7 |
Education/Reference | 10190 | 4185 | 88 | 2.1 | 4432 | 134 | 3.0 |
Entertainment | 5297 | 2610 | 47 | 1.8 | 2619 | 98 | 3.7 |
Recreation/Hobbies | 1098 | 754 | 13 | 1.7 | 760 | 95 | 12.5 |
Blogs/Wiki | 5415 | 3095 | 42 | 1.4 | 3055 | 237 | 7.8 |
Technical/Business Forums | 1297 | 717 | 9 | 1.3 | 734 | 17 | 2.3 |
Non-Profit/Advocacy/NGO | 2713 | 1842 | 22 | 1.2 | 1866 | 24 | 1.3 |
Games | 2173 | 925 | 9 | 1.0 | 896 | 11 | 1.2 |
Public Information | 2346 | 1049 | 8 | 0.8 | 1084 | 27 | 2.5 |
Govern.Military | 3754 | 939 | 5 | 0.5 | 974 | 7 | 0.7 |
Uncategorized | 1616 | 636 | 3 | 0.5 | 646 | 2 | 0.3 |
Pornography | 1388 | 528 | 0 | 0.0 | 534 | 0 | 0.0 |
Based on desktop crawls using the no-action mode
Password leaks: Leaky websites that we identified incidental password collection by tracker domains. These issues are already fixed by the involved third parties, thanks to our disclosures.
EU | ||
---|---|---|
Rank | Website | Tracker domain |
84544 | bolshayaperemena.online | yandex.com |
95341 | jolly.me | yandex.com |
71216 | unitedtraders.com | yandex.com |
88449 | strelka.com | yandex.com |
82147 | livedune.ru | yandex.com |
63801 | megabonus.com | yandex.ru |
45753 | galaksion.com | yandex.com |
82766 | publicize.co | lr-ingest.io |
4699 | olymptrade.com | yandex.com |
41729 | www.smartfinstories.biz | yandex.com |
12989 | app.travelpayouts.com | yandex.ru |
73186 | vitaexpress.ru | yandex.com |
73456 | www.rookee.ru | yandex.com |
61411 | www.bajajauto.com | yandex.com |
77435 | www.kupibilet.ru | yandex.com |
55176 | prom.md | yandex.com |
26742 | bitmedia.io | yandex.ru |
85477 | kismia.com | yandex.com |
84547 | www.sellingpower.com | mixpanel.com |
12804 | ctc.ru | yandex.com |
87547 | apphud.com | yandex.com |
4922 | www.exness.com | yandex.com |
15339 | www.giraff.io | yandex.com |
92125 | www.darkorbit.com | bigpoint.net |
54164 | app.zeydoo.com | yandex.com |
37783 | autopiter.ru | yandex.com |
58315 | www.technodom.kz | yandex.com |
30800 | fboom.me | yandex.ru |
33030 | smmplanner.com | yandex.com |
23105 | expertoption.com | yandex.com |
89963 | www.wifimap.io | yandex.com |
98251 | www.youhodler.com | yandex.com |
61734 | spartak.com | yandex.com |
31769 | mybook.ru | yandex.com |
54886 | www.amediateka.ru | yandex.com |
49600 | youdo.com | yandex.com |
32302 | www.forumhouse.ru | yandex.ru |
18107 | www.sibnet.ru | yandex.com |
68865 | www.toyota.ru | yandex.com |
54997 | www.b-kontur.ru | yandex.com |
81703 | www.ps.kz | yandex.com |
58760 | propush.me | yandex.com |
63202 | ofd.astralnalog.ru | yandex.com |
31678 | lingualeo.com | yandex.ru |
77561 | www.ligastavok.ru | yandex.com |
16434 | www.open.ru | yandex.com |
US | ||
---|---|---|
Rank | Website | Tracker domain |
95341 | jolly.me | yandex.ru |
84544 | bolshayaperemena.online | yandex.ru |
88449 | strelka.com | yandex.ru |
71216 | unitedtraders.com | yandex.ru |
82147 | livedune.ru | yandex.ru |
63801 | megabonus.com | yandex.ru |
1688 | www.championat.com | yandex.ru |
45753 | galaksion.com | yandex.ru |
44662 | perm.zarplata.ru | yandex.ru |
16639 | satu.kz | yandex.ru |
82766 | publicize.co | lr-ingest.io |
12989 | app.travelpayouts.com | yandex.ru |
73186 | vitaexpress.ru | yandex.ru |
55176 | prom.md | yandex.ru |
61411 | www.bajajauto.com | yandex.ru |
77435 | www.kupibilet.ru | yandex.ru |
85477 | kismia.com | yandex.ru |
73456 | www.rookee.ru | yandex.ru |
26742 | bitmedia.io | yandex.ru |
87547 | apphud.com | yandex.ru |
15339 | www.giraff.io | yandex.ru |
12804 | ctc.ru | yandex.ru |
92125 | www.darkorbit.com | bigpoint.net |
84547 | www.sellingpower.com | mixpanel.com |
54164 | app.zeydoo.com | yandex.ru |
28758 | secretmag.ru | yandex.ru |
30800 | fboom.me | yandex.ru |
89963 | www.wifimap.io | yandex.ru |
47558 | lentaru.media.eagleplatform.com | yandex.ru |
33030 | smmplanner.com | yandex.ru |
98251 | www.youhodler.com | yandex.ru |
58315 | www.technodom.kz | yandex.ru |
23105 | expertoption.com | yandex.ru |
31769 | mybook.ru | yandex.ru |
49600 | youdo.com | yandex.ru |
54886 | www.amediateka.ru | yandex.ru |
68865 | www.toyota.ru | yandex.ru |
61734 | spartak.com | yandex.ru |
32302 | www.forumhouse.ru | yandex.ru |
18107 | www.sibnet.ru | yandex.ru |
81703 | www.ps.kz | yandex.ru |
54821 | www.viennahouse.com | yandex.ru |
54997 | www.b-kontur.ru | yandex.ru |
63202 | ofd.astralnalog.ru | yandex.ru |
58760 | propush.me | yandex.ru |
31678 | lingualeo.com | yandex.ru |
16434 | www.open.ru | yandex.ru |
EU Mobile leaks: Email address leaks to tracker domains in the EU-mobile crawl, no-action mode
1. Site rank: Tranco rank 2. Encoding: Encoding or hash algorithm used when sending the email 3. Website: Hostname of the initially visited website (before a potential redirection) 4. Request Domain: eTLD+1 of the leaky request URL 5. Third Party Entitiy: Owner of the tracker domain 6. Tracker Category: Category of the tracker domain, this information comes from DuckDuckGo's Tracker Radar dataset 7. Blocklist: Blocklist that detected the tracker. WTM: whotracks.me, uBO: uBlock Origin, DDG: DuckDuckGo, DS: Disconnect 8. Page URL: (last_page) URL of the page where our crawler filled the email field 9. XPath: XPath of the filled email field 10. Id:ID of the email element that our crawler filled (10, 11 & 12 identify the leaking page and input elements. They can be used for debugging, reproduction etc.)
US Mobile leaks: Email address leaks to tracker domains in the US-mobile crawl, no-action mode
1. Site rank: Tranco rank 2. Encoding: Encoding or hash algorithm used when sending the email 3. Website: Hostname of the initially visited website (before a potential redirection) 4. Request Domain: eTLD+1 of the leaky request URL 5. Third Party Entitiy: Owner of the tracker domain 6. Tracker Category: Category of the tracker domain, this information comes from DuckDuckGo's Tracker Radar dataset 7. Blocklist: Blocklist that detected the tracker. WTM: whotracks.me, uBO: uBlock Origin, DDG: DuckDuckGo, DS: Disconnect 8. Page URL: (last_page) URL of the page where our crawler filled the email field 9. XPath: XPath of the filled email field 10. Id:ID of the email element that our crawler filled (10, 11 & 12 identify the leaking page and input elements. They can be used for debugging, reproduction etc.)
Previously Unlisted Tracker Domains: Previously unknown tracker domains that collect email addresses.
Received email samples: In the six-week period following the crawls, we received 290 emails from 88 distinct sites on the email addresses used in the desktop crawls, despite not submitting any form.
Some emails invite us back to their site
Some emails offer a discount to their site
Novel compression, encoding and hashing methods: We extended leak detection from prior work to detect new compression, encoding and hashing methods used to exfiltrate email addresses.
⚠️ Leaks to Meta (Facebook) & TikTok
- Both Meta Pixel and TikTok Pixel has a feature called Automatic Advanced Matching [1, 2] that collects hashed personal identifiers from the web forms in an automated manner. The hashed personal identifiers are then used to target ads on the respective platforms, measure conversions, or create new custom audiences. (You can read about privacy issues caused by Meta Pixel on The Markup's Pixel Hunt series.)
- According to Meta's, and TikTok's documentation, Automatic Advanced Matching should trigger data collection when a user submits a form. We found that unlike what is claimed, both Meta and TikTok Pixel collects hashed personal data when the user clicks links or buttons that in no way resemble a submit button. In fact, Meta and TikTok scripts don't even try to recognize submit buttons, or listen to (form) submit events. You can view their overly broad and suspiciously similar list of selectors, which designates what page elements will trigger data collection. That means Meta and TikTok Pixel collect hashed personal information, even when a user decides to abandon a form, and clicks a button/link to navigate away from the page.
- In March 2022, we ran additional crawls of top 100K websites to detect leaks triggered by unrelated button or link clicks. Our crawler filled the email and password fields, and then clicked on a no-op (non-functional) button that it injected into the page. Injecting and clicking on the no-op button enabled us to detect leaks that would be triggered by Meta and TikTok's Automatic Advanced Matching. We found that 8,438 (US) / 7,379 (EU) sites may leak to Meta when the user clicks on virtually any button or a link, after filling up a form. In addition, we found 154 (US) / 147 (EU) sites that may leak to TikTok in a similar manner.
- Full scale of leaks to Meta, and all leaks to TikTok were discovered after finalizing our paper, through the crawls described in the previous bullet point. There are two main reasons for that: 1) TikTok and Meta leaks are slightly different than the rest of the leaks we studied in our paper--they require further interaction with the page after filling out a form. 2) TikTok started beta testing the Automatic Advanced Matching feature in February 2022, long after we finish the crawls used in the paper (May'21). That also means, unlike the other results in our paper, some of the findings presented above (on Meta & TikTok pixel) are not peer-reviewed.
- We filed a bug report with Meta (25 March 2022), and reached out to TikTok using their Contact the Data Protection Officer form and Request privacy information form (21 April 2022). Meta swiftly responded to our bug report and said they assigned the issue to their engineering team (25 March 2022). TikTok hasn't yet responded to our disclosure. (The report to TikTok was filed more recently, since leaks to TikTok were discovered more recently--in fact while investigating the leaks to Meta.)
-
Disclosure to Meta
SubscribedButtonClick event fires on virtually every click, causing PII collection against user intent [Link]
When Automatic Advanced Matching is enabled, SubscribedButtonClick event is fired after clicking virtually any button or link on a page. That means Meta Pixel collects hashed personal information, even when a user decides to abandon a form, and clicks a button/link to navigate away from the page.
According to its official page [1], Automatic Advanced Matching should trigger data collection when a user submits a form: "After the visitor clicks Submit, the pixel's JavaScript code automatically detects and passes the relevant form fields to Facebook."
Unlike what is claimed, Meta Pixel collects hashed personal data when the user clicks links or buttons that in no way resemble a submit button (attached screenshot). In fact, Meta's JavaScript code in question doesn't even try to recognize submit buttons, or listen to (form) submit events. See the attached screen captures:
abcmouse.com (a website for children): Meta Pixel collects the hashed email address when the user closes the newsletter dialog. In that case, sharing the email address is the exact opposite of the user's intent ( attached screen capture).
prothomalo.com: clicking Back, Terms of Service or Privacy Policy links triggers the collection of the hashed email address, and (hashed) first and last names. (attached screen capture)
We hope you will recognize the disconnect between the described and actual behavior of Automatic Advanced Matching, and take necessary actions to address this issue.
[1] https://web.archive.org/web/20220325001706/https://www.facebook.com/business/m/signalshealth/optimize/automatic-advanced-matchingPS: This bug report is based on an academic study. If you need more technical details or have difficulties reproducing, contact [email address].
Disclosure to Tiktok
A slightly modified version of the following text is submitted to conform the word limit of the contact form.Automatic Advanced Matching causes PII collection against user intent
When Automatic Advanced Matching is enabled, users' hashed email address is collected after clicking virtually any button on the page. That means TikTok Pixel collects hashed personal information, even when a user decides to abandon a form, and clicks a button or link to navigate away from the page.
According to TikTok’s help pages [1], Automatic Advanced Matching should trigger data collection when a user submits a form:
"Automatic Advanced Matching is programmed to recognize form fields, specifically capturing email and phone data. The capability is only triggered when the visitor submits information in an email or phone input field."
Unlike what is claimed in the linked help page, TikTok Pixel also sends hashed personal data when the user clicks links or buttons that in no way resemble a submit button. In fact, the TikTok's JavaScript code in question doesn't even try to recognize submit buttons, or listens to form "submit" events.
That means clicking almost any button or a link will trigger the data collection, regardless of the user’s intention. The issue is demonstrated in the following screen captures.
Dismissing the consent dialog on kiwico.com/login causes Tiktok to collect the hashed email address from the login form: screen capture
Clicking a radio button triggers collection on redcross.ca: screen capture
On benjerry.com, hashed email collection is triggered by clicking on the email input field--not even a button or a link: screen capture
We hope you will recognize the disconnect between the described [1] and actual behavior of Automatic Advanced Matching, and take necessary actions to address this issue.
[1] https://archive.ph/BQ7nt#selection-3501.113-3501.217
PS: This bug report is a result of an academic study. If you need more technical details or have difficulties reproducing, please reach us at: [contact email]
1. Site rank: Tranco rank 2. Encoding: Encoding or hash algorithm used when sending the email 3. Website: Hostname of the initially visited website (before a potential redirection) 4. Request Domain: eTLD+1 of the leaky request URL 5. Page URL: (last_page) URL of the page where our crawler filled the email field 6. XPath: XPath of the filled email field 7. Id:ID of the email element that our crawler filled (5, 6 & 7 identify the leaking page and input elements. They can be used for debugging, reproduction etc.)
1. Site rank: Tranco rank 2. Encoding: Encoding or hash algorithm used when sending the email 3. Website: Hostname of the initially visited website (before a potential redirection) 4. Request Domain: eTLD+1 of the leaky request URL 5. Page URL: (last_page) URL of the page where our crawler filled the email field 6. XPath: XPath of the filled email field 7. Id:ID of the email element that our crawler filled (5, 6 & 7 identify the leaking page and input elements. They can be used for debugging, reproduction etc.)
📣 Security Disclosures, GDPR Requests, and Leak Notifications
Our methods allow us to detect email and password leaks from clients to trackers, but what happens after the leaks reach third party's servers is unknown to us. In order to better understand the server-side processing of collected emails, and to disclose cases of password collection, we have reached out to more than a hundred first and third parties. In all cases we sent the emails using one of the authors' university email addresses, and real name; while also disclosing the nature of the research.
Password collection disclosure
Once again we note that we believe all password leaks to third parties mentioned below are incidental.
- Yandex, the most prominent tracker that collects users' passwords, has quickly responded to our disclosure and rolled out a fix to prevent password collection. We have also notified more than 50 websites where passwords were collected. Since the majority of the websites embedding Yandex were in Russian, we have enclosed a Russian translation of our message in the notification email, along with our message in English.
- Mixpanel released an update only two days after we disclosed the issue. With this change, even the users with outdated SDKs--which was the root cause of the problem--were protected from collecting passwords involuntarily.
- LogRocket, who collected passwords on publicize.co's login page, have never replied to our repeated contact attempts; and the password leak remained on Publicize's website for more than ten weeks, before it was finally fixed. We have also enrolled the help of a contact at the Electronic Frontier Foundation, who tried calling LogRocket's phone number, emailed their privacy contact address, and their cofounder—all to no avail. Our attempts to disclose the issue via LogRocket's chatbot have also failed. We have also contacted Publicize, and have not heard back.
GDPR requests on email exfiltration to first & third parties
We reached out to 58 first and 28 third parties with GDPR requests. We avoided sending blanket data access requests to minimize the overhead for the entities who were obliged to respond to our GDPR requests. Instead, we asked specific questions about how the collected emails are processed, retained and shared.
Sample GDPR requests:
1. Sample GDPR request sent to third parties
To Whom It May Concern:
I and my colleagues from multiple European research institutions are investigating personal data collection on popular websites. During our experiments, we found that third-party.com collects email addresses from input fields before the user submits the form. We detected this behavior on several websites including first-party.com.
As an EU resident myself, I am requesting access to the following information pursuant to Article 15(1) GDPR:
- What are the processing purposes for collecting email addresses before form submission?
- What is the legal basis for collecting email addresses before form submission (see article 6(1) GDPR)?
- What is the retention period for the collected email addresses?
- Do you share email addresses with other parties, including your business partners?
The email address I'm sending this request for is "email-prefix+first-party.com@gmail.com" [in CC]. Please feel free to send a verification code to that address to make sure it belongs to me.
Please note that this is not a blanket data access request. Rather, our questions aim to bring transparency to the collection of email addresses that occur before form submission. We purposefully limited
our questions to minimize the overhead
for you.
We would appreciate it if you could update us if you change the way email addresses are collected before form submission. We plan to describe responses to our disclosure in our academic paper, which is a collaboration between researchers from law and computer science disciplines.
We would like to stress that we have not captured data of your visitors, or any other user in our study. Please feel free to reach out if you need any clarification or additional information regarding our request.
Kind regards,
Name Surname
2. Sample GDPR request sent to first parties
To Whom It May Concern:
I and my colleagues from multiple European research institutes are investigating personal data collection on popular websites. During our experiments, we found that a third-party (third-party.com) on your website (first-party.com) collects visitors' email addresses from a form field even if the visitor doesn't submit the form, and doesn't give their consent. Technical details pertaining to the data collection can be found at the end of this email.
As an EU resident myself, I would like to request access to the following information pursuant to Article 15(1) GDPR:
- Were you aware that (XXX) collects email addresses from input fields on your website before website visitors
clicked 'submit'? - What are the processing purposes for collecting email addresses before form submission?
- What is the legal basis for collecting email addresses before form submission (see article 6(1) GDPR)?
- What is the retention period for email addresses collected before form submission?
Please note that this is not a blanket data access request. Rather, our questions aim to bring transparency to the collection of email addresses that occur before form submission. We purposefully limited our questions to minimize the overhead for you.
We would appreciate it if you could update us if you take any action to change the way that email addresses are collected on your website. We plan to describe websites' responses to our disclosure in our academic paper, which is a collaboration between researchers from law and computer science disciplines.
We would like to stress that we have not captured data of your visitors, or any other user in our study. Please feel free to reach out if you need any clarification or additional information regarding our
request.
Kind regards,
Name Surname
Technical details of the email collection:
- The email address (email-prefix+first-party.com@gmail.com") was collected on (first- party.com/inner-page) from the input field with the xpath="XXX".
- If you would like to verify that the collected email address belongs to me, feel free to send a verification code to ("email-prefix+first-party.com@gmail.com").
- The Unix timestamp of the visit during which my email
was collected was (XXX). - The email address was sent to (first-party.com) in
Base64-encoded/SHA-256-hashed/... form in the following request: third-party.com/leak-endpoint-full-url, post-Data: ...
A sample of responses from first parties: 30/58 first parties replied
- Fivethirtyeight.com (via Walt Disney's DPO), trello.com (Atlassian), lever.co, branch.io and cision.com said they had not been aware of the email collection prior to form submission on their websites and since addressed the issue.
- Marriott said that the information collected by Glassbox is used for purposes including customer care, technical support, and fraud prevention.
- Tapad, a cross-device tracking company on whose web- site we found an email leak, said that they are not offering their services to UK & EEA users since August, 2021; and they have deleted all data that they held from these regions.
- stellamccartney.com explained that the emails on their websites were collected before the submission due to a technical issue, which was fixed upon our disclosure. According to their response, the SaleCycle script that collected email addresses had not been visible to their cookie management tool from OneTrust.
A sample of responses from third parties: 15/28 third parties replied
- Taboola said in certain cases they collect users' email hashes before form submission for ad and content personalization; they keep email hashes for at most 13 months; and they do not share them with other third parties. Taboola also said they only collect email hashes after getting user consent. However, upon sharing our findings showing otherwise, they acknowledged and fixed the issues on the reported websites. Reportedly, the data collection were triggered before consent due to 1) websites outside the EU who do not recognize the GDPR, or 2) misconfiguration of consent management platforms.
- Zoominfo said their “FormComplete” product appends contact details of users to forms, when the user exists in ZoomInfo's sales and marketing database. They said the ability to capture form data prior to submission can be enabled or disabled by their clients.
- ActiveProspect said their TrustedForm product is used to certify consumer's consent to be contacted for compliance with regulations such as the Telephone Consumer Protection Act in the US. They said data captured from abandoned forms are marked for deletion within 72 hours, is not shared with anyone including the site owner.
Notification to websites with email leaks in the US crawl
We sent a friendly notification to these websites about the email exfiltration, rather than a formal GDPR request. We did not get any response from these 33 websites.
Sample notification sent to websites with email leaks
in
the US crawl:
To Whom It May Concern:
I and my colleagues from multiple European research institutes are investigating how and why email addresses are collected from online forms. During our investigations,
we found that when (first-party.com) is visited from the US, a third-party (third-party.com) collects visitors' email addresses from a form field even if the form is abandoned (never submitted). Technical
details to reproduce this issue can be found at the end of this email.
We wanted to inform you since we found that websites may not always be aware that their visitors' email addresses (or their hashes) are collected by third-party scripts, before submitting
any forms.
We would appreciate it if you answer the following questions, but we note that you don't have an obligation to do so:
- Were you aware that third-party.com collects email addresses from input fields on your website before website visitors clicked 'submit'?
- What are the processing purposes for collecting email addresses before form submission?
- What is the retention period for email addresses collected before form submission?
Please note that this is not a data access request. We purposefully limited our questions to minimize the overhead for you. Rather, our questions aim to bring transparency to the collection of email addresses
that occur before form submission.
We would appreciate it if you could let us know if you take any action to change the way that email addresses are collected on your website. We plan to
describe websites' responses to our disclosure in our academic paper, which is a collaboration between researchers from law and computer science disciplines.
We would like to stress that we have not captured data of your visitors, or any other user in our study.
Please feel free to reach out if you need any additional information about our disclosure.
Technical details:
- The email address was collected on (first- party.com/inner-page) from the input field with the id="XXX" and the xpath="XXX", and was sent to (third-party.com). We will be happy to provide additional
technical details and a screen capture of the email collection if that will make it easy for you to verify the issue.
Kind regards,
Name Surname
⏩ Follow-up Crawls
We ran additional crawls between 25-31 January 2022 to collect fresh data about the behavior we studied. Here are some highlights:
- Many websites where we detected leaks to Taboola started to use modal consent banners, which prevents interaction with the pages before giving consent. We didn't detect any email leaks on these websites without interacting with the consent banners, which substantially reduced the number of leaks to Taboola.
- Adroll started showing a consent dialog which reduced the number of leaks to Adroll to zero in the crawl. However, in manual follow up analysis we found several websites where Adroll collected hashed emails when the user simply clicked on the page. We have shared these examples with Adroll, who then addressed the issues and said that their "work to troubleshoot and deploy a comprehensive solution is underway with testing and incremental roll out".
- We identified incidental password collection by FullStory, Hotjar, Decibel and Yandex. Upon disclosures, Fullstory and Hotjar swiftly fixed the issues, which were due to mistakes on part of the first parties (e.g. copying the password values into other DOM elements' attributes). Yandex said they need some time to solve the issue, which they eventually did.
- In a manual follow-up investigation, we found additional password leaks to LogRocket on the login form of the zoning.sandiego.gov website. We've disclosed the issue to both LogRocket, zoning.sandiego.gov and opencounter.com (the provider of the web application running on zoning.sandiego.gov). We didn't get any response from those parties, but we verified that password leaks to LogRocket were eventually addressed.
EU email leaks (Follow-up crawl)
1. Site rank: Tranco rank 2. Encoding: Encoding or hash algorithm used when sending the email 3. Website: Hostname of the initially visited website (before a potential redirection) 4. Request Domain: eTLD+1 of the leaky request URL 5. Third Party Entitiy: Owner of the tracker domain 6. Tracker Category: Category of the tracker domain, this information comes from Tracker Radar Collector"s dataset 7. Blocklist: Blocklist that detected the tracker. WTM: whotracks.me, uBO: uBlock Origin, DDG: DuckDuckGo, DS: Disconnect 8. Page URL: (last_page) URL of the page where our crawler filled the email field 9. XPath: XPath of the filled email field 10. Id:ID of the email element that our crawler filled (10, 11 & 12 identify the leaking page and input elements. They can be used for debugging, reproduction etc.)
US email leaks (Follow-up crawl)
1. Site rank: Tranco rank 2. Encoding: Encoding or hash algorithm used when sending the email 3. Website: Hostname of the initially visited website (before a potential redirection) 4. Request Domain: eTLD+1 of the leaky request URL 5. Third Party Entitiy: Owner of the tracker domain 6. Tracker Category: Category of the tracker domain, this information comes from Tracker Radar Collector"s dataset 7. Blocklist: Blocklist that detected the tracker. WTM: whotracks.me, uBO: uBlock Origin, DDG: DuckDuckGo, DS: Disconnect 8. Page URL: (last_page) URL of the page where our crawler filled the email field 9. XPath: XPath of the filled email field 10. Id:ID of the email element that our crawler filled (10, 11 & 12 identify the leaking page and input elements. They can be used for debugging, reproduction etc.)
Password leaks (Follow-up crawl)
EU | ||
---|---|---|
Rank | Website | Tracker domain |
20674 | clearbanc.com | fullstory.com |
28226 | nexmo.com | decibelinsight.net |
98254 | www.agrofy.com.ar | hotjar.com |
US | ||
---|---|---|
Rank | Website | Tracker domain |
20043 | www.nav.com | fullstory.com |
20674 | clearbanc.com | fullstory.com |
56040 | www.medikforum.ru | yandex.ru |
89002 | www.appjobs.com | hotjar.com |
98254 | www.agrofy.com.ar | hotjar.com |
🎇 LeakInspector: an add-on that warns and protects against personal data exfiltration
- We developed LeakInspector to help publishers and end-users to audit third parties that harvest personal information from online forms without their knowledge or consent. It has the following features:
- Blocks requests containing personal data extracted from the web forms and highlights related form fields by showing add-on's icon.
- Logs technical details of the detected sniff and leak attempts to console to enable technical audits. The logged information includes the value and XPath of the sniffed input element, the origin of the sniffer script, and details of the leaky request such as the URL and the POST data.
- LeakInspector also features a user interface where recent sniff and leak attempts are listed, along with the tracker domain, company and tracker category. The user interface module is based on DuckDuckGo’s Privacy Essentials add-on.
- ⚠️ The add-on is a proof-of-concept, and has not been tested at scale. Please use at your own discretion.
- Our attempts to publish the add-on on the Chrome Web Store failed, because new uploads of Manifest v2 add-ons are not accepted. For leak detection, our add-on requires access to network request
details, which will be disallowed in Manifest v3.
We are working on publishing the add-on for Firefox.Our attempts to publish the add-on for Firefox have failed.
📁 Data and Code
Our crawl data (screenshots, request details, HTML sources), the source code for the crawler, analysis scripts, and the LeakInspector add-on can be found in the following links:- The data from ten crawls performed between May 2021 and June 2021
- Main repository containing leak detection and analysis scripts
- LeakInspector add-on source code
- Crawler source code
☝️ Questions & Answers
Q: Why do you only focus data collection prior to form submission?
A: We believe it is strongly against users’ expectations to collect personal data from web forms for tracking purposes prior to submitting a form. We wanted to measure this behavior to assess its prevalence.
Q: Do people really abandon forms that they start filling in?
A: According to a survey by The Manifest, 81% of the 502 respondents have abandoned forms at least once, and 59% abandoned a form in the last month.
Q: Who decides to collect form data before submission: websites or third parties?
A: Depending on the case, it may be the website who configures the third-party script to collect data before form submission; or this may be the third party’s default behavior.
Q: Are websites aware of third parties collecting form data before submission?
A: Some websites told us that they were not aware of this data collection and rectified the issue upon our disclosures.
Q: How are Meta and TikTok results different from the ones you present in the paper?
A: TikTok and Meta leaks require further interaction with the page after filling out a form. However, our screen captures show how easy it is to trigger this data collection.
Q: Did you share your findings with anybody prior to public release?
A: We have shared an earlier version of our paper with certain privacy authorities and browser vendors (Google, Mozilla, Brave, Apple and DuckDuckGo). Brave asked for our dataset, and encouraged us to reach out to the blocklist maintainers to add missing tracker domains. DuckDuckGo invited us to give a talk to present our findings. We had a call with a Mozilla engineer to discuss potential solutions to the issue. A European DPA asked for the list of websites/third parties from their country engaging in email exfiltration, which we have shared.
Q: In the screen captures you use a form (“SHA256 Online”) to convert the email address to a random looking string. What is that for?
A: Some third parties collect email addresses after hashing them. Please see this blog post on what hashing is, and why hashing email addresses does not protect your privacy.
Q: How many distinct trackers were found to collect email addresses?
A: Emails (or their hashes) were sent to 174 distinct domains (eTLD+1) in the US crawl, and 157 distinct domains in the EU crawl.
Acknowledgments
We thank Alexei Miagkov, Arvind Narayanan, Bart Jacobs, Bart Preneel, Claudia Diaz, David Roefs, Dorine Gebbink, Galina Bulbul, Gwendal Le Grand, Hanna Schraffenberger, Konrad Dzwinel, Pete Snyder, Sergey Galich, Steve Englehardt, Vincent Toubiana, our shepherd Alexandros Kapravelos, SecWeb and USENIX Security reviewers for their valuable comments and contributions. The idea for measuring email exfiltration before form submission is initially developed with Steve Englehardt and Arvind Narayanan during an earlier study. Asuman Senol was funded by the Cyber-Defence (CYD) Campus of armasuisse Science and Technology. Gunes Acar was initially supported by a postdoctoral fellowship from the Research Foundation Flanders (FWO). The study was supported by CyberSecurity Research Flanders with reference number VR20192203.
Corrigendum
- 13 May 2022: The initial version of our website and paper incorrectly referred TowerData as the owner of the rlcdn.com domain. The rlcdn.com domain belongs to LiveRamp. We've also reported this issue to Disconnect, which was one of the sources we used to identify domain ownership.