challenges around AI and the GDPR

23 Sep, 2025

Today, I once again met up with my mentor around data protection law and had some questions about his view on the compatibility of AI with several aspects of the General Data Protection Regulation (GDPR). The talk went really well and was super engaging, but I soon had to run home so I would not miss the GDPRhub meeting by noyb.eu!

Quick plug: Consider donating to noyb.eu or even becoming a member if you care about defending data protection and privacy from Meta, Google et. al. :) I am volunteering as a Country Reporter for them!

The meeting included a really great presentation on the exact topics I had discussed earlier, and motivated me to write a post about it, so here we are.

✦•······················•✦•······················•✦

data sets

The first issues start with the actual aggregation of data specific models are trained on.

There are, of course, a lot of internal, small models being trained on a very limited dataset, for a highly specific purpose. It is usually in the best interest of their developers to keep it that way to not dilute the output and keep it small. What would scraping half the net do for a model that is supposed to help with a specific template document at work? But for the big ones (ChatGPT, Copilot, Grok and more) that are decidedly supposed to be allrounders who can be used for anything, there is a clear incentive to vacuum up anything they can.

This directly contradicts the principle of data minimization. Article 5 lists several principles relating to the processing of personal data, and data minimization is the idea that processing should be adequate, relevant and limited to what is necessary in relation to the purposes. In practice, this means acquiring the least amount of personal data to get the job done.

For example: In recent rulings, it was declared unlawful that train companies in the EU¹ demand your gender, email address and/or telephone number just to buy a train ticket, as these infos aren't needed for a customer to use a train, nor for the company to provide the service. It should be optional to share that, at least. This principle usually also incorporates aspects of another principle: The storage limitation. Personal data should only be stored as long as is necessary, but how do you decide how long a name, address, telephone number and similar information is necessary for the purposes in a huge dataset? Depending on the methods, it might even be impossible to remove.

Furthermore, there's the principle of purpose limitation. Processing needs a stated purpose that needs to be specific, explicit and legitimate. With limited models, this may be easier, as the use for it is very targeted; but it is a point of contention in the legal discourse about whether something as very vague as "AI model training" or similar purposes related to that are specific enough, and if companies can reference an exemption for their models based on scientific research purposes.

Also: It is likely that the data they are scraping has been put out there or been acquired for a different purpose. If I consent to having my picture taken and put on the company website to promote our product and drive customer engagement, I am not consenting to the fact that 5 years down the line, my image will be used for AI training, for example. A change of purpose requires information, maybe even renewed consent, but that is a mixed bag.

It is relatively easy to inform and get consent of users if they have an account on your platform, as you can show them a note about it and let them make a decision - but what about data scraping outside of platforms? How can users outside of services like the ones Meta, Microsoft, X or Google own get a chance to consent or be informed about their personally identifiable data being used to train AI?

The GDPR handles information requirements in two main ways:

Article 13 is for companies (usually referred to as "controllers") obtaining data and consent directly from you, and Article 14 is for the case in which the controller obtains data of you indirectly (for example, by using someone else's datasets, or getting your data transmitted to them by the company who actually obtained it directly from you). So in the case of the broad scraping taking place, companies are still technically required to inform you based on Article 14.

The problem is that obviously, it is not feasible in practice. If they scrape your name and have nothing else, how will they contact you? How are their employees supposed to search through terabytes of data to search for personally identifiable data?² How would they detect sensitive data that comes with extra requirements, like data related to health, sexual identity, ethnic origin, religion and more (Article 9) or the data of children (Article 8), and fulfill them? How are they supposed to contact millions of people, and track who has opted out and who has opted in, who hasn't replied, etc.?

Article 14 acknowledges this issue in section 5 when it says that they don't have to inform you if it would not be feasible to do so or would require disproportional effort. This effectively means that most big companies training AI with extremely large datasets scraping the entire net are off the hook from telling data subjects (you, as the affected person) about it.

Consent is similarly tricky. Article 7 of the GDPR sets conditions for consent, which say that the controller (company) needs to demonstrate that you gave consent³ and that the way to consent should not be misleading, should be distinguishable from other matters, accessible, easy to understand, not coerced and is not binding if it fails these standards. You have the right to withdraw your consent at any time. Again, this might be fulfilled in your user account settings, but how will you think of withdrawing consent if you have never even been informed that you are affected, and never had a chance of consenting to begin with?

By law, and by the way the tech works, you withdrawing consent is only for future processing and will not affect past processing, and not all processing needs to have its legal basis in consent.⁴

legal basis

Article 6 covers the legal basis for any processing of personal data in the EU or of EU citizens. In practice, that means that usually, one or more of Article 6 a) - f) will apply and is used as legal basis. Only one needs to be fulfilled though for it to be lawful, and consent is only one of them.

a) handles you giving consent for one or more specific purposes (blanket consent doesn't count!).

b) is for when the processing is necessary to fulfill a contract - think about you ordering from an online shop, and they need to give your address to the shipping company.

c) covers legal obligations; for example, back when restaurants were required to ask for your name and contact info to comply with Covid-19 measures the government instated.

d) is for the niche cases where processing your data is necessary to protect you or someone else. A case I could think of is maybe if security camera footage is used where you are seen, to help find a thief that stole from you or the neighbor.

e) is a bit vague, covering what's "necessary for the performance of a task carried out in the public interest or in the exercise of official authority vested in the controller".

f) is the catch-all, saying processing is necessary due to a legitimate interest by the company or third party, unless these are overridden by the interests and fundamental rights and freedoms of the data subject (you as the affected person), especially if it is a child.

As covered above, consent can be acquired with a user account, but not otherwise, so for a chunk of data scraped from the web outside of social media platforms, that falls away. For example, if they scrape my blog, I did not give consent, and I also do not have a contract with them. They have no legal obligation to scrape it, the scraping doesn't protect me or someone else, it is likely not in the public interest and they aren't an official authority either. That leaves the catch-all: Legitimate interest.

legitimate interest

'Legitimate interest' can only be claimed as legal basis if the processing violates no part of the GDPR, as to not use it as a loophole; but arguably, as seen above, there at least a lot of questions, if not downright violations of the law as it is right now. Additionally, the sensitive data I already mentioned (from Article 9) cannot be processed under legitimate interest. That would be a chunk of data without a legal basis, then.

It's also a delicate matter to discuss in general - if you believe AI will significantly advance humanity and is a superpower, you would surely argue that this is a legitimate interest to pursue and the rights and freedoms of data subjects don't trump that. But how do you prove this speculation about the future?

What if we focus solely on more realistic interests, like the economic incentive of companies (the legitimate interest in keeping up with competition or making a lot of money) or more vague reasons like "providing a high quality tool serving our users and helping in research and development"? Would that be enough? How would the company argue that, for example, my blog data is necessary for this legitimate interest?

But aside from that, there are several things that could impact the rights and freedoms of you, the data subject.

impact on you

There is the problem that the more data someone acquires, the more likely it is that data of you is in it. It doesn't have to immediately be personal data, but it is recognized in the discourse that when enough data is available, they can link up and become personally identifiable data that is no longer anonymous or unrelated.

As an easy to understand example: An address by itself is not personal data, as it cannot be linked to anyone, but if you slowly fill in occupation, hair color, car brand, online orders, it becomes more likely that you can deduct who of the residents it is. A username like "RickRoller2000" by itself with no other information is not by itself personally identifiable data, but if you suddenly also have the actual name of the person behind the account and link it to their username, the username becomes personally identifiable data.

That poses the issue that a switch can happen for data that was previously not falling under GDPR that suddenly does. This introduces more complexity about everything I have mentioned prior, and that also means your rights.

Additionally, there are inherent risks and difficulty of enforcing your rights (specific to the product) that need to be taken into account when judging legitimate interest vs. rights and freedoms. Things like:

Once in it, it might be difficult or impossible to remove (violating the data protection goal of intervenability, and the right of deletion in Article 17). At best, the output may be restricted as in Article 18.
Objection/withdrawal from consent only ex nunc, not ex tunc (= doesn't affect past processing).
Possibility of harmful hallucinations that slander people - like when a journalist reports about murder, and the model generates that the journalist committed the murder. If it weren't trained on these articles, it wouldn't happen. There are less severe cases too, about the output mischaracterizing or misattributing someone's work or impact. As some of these models increasingly seek to replace a Google Search or Wikipedia, this should not be underestimated. At best, the output could be influenced by corrections done by the company⁵ via Article 16.
Possible inability to fulfill your right of access to your data (obtaining a copy) or confirmation that your data is being used (based on Article 15).
A big likelihood of being uninformed, not having consented, or not having had the proper chance to consent; either because you don't have a user account, or you don't check your account settings to notice this new setting suddenly popping up, and also because Meta and others have set very short deadlines to opt out. Many services have decided to implement it this way to get as much consent as possible, but this also violates the GDPR as consent needs to be freely and clearly given and opted into (in practice: via sliders or checkboxes that are not activated by default) and the principle of privacy-friendly defaults (also called: "Privacy by Design" and "Privacy by Default"). You need to know what you consent to, and you can't do that if you never go into your settings to see an already-enabled setting.
Doubts about the companies' ability to apply different protections and requirements to different types of data in a dataset, for example health data and data of children; difficulty of children to understand the consequences and properly consent.
Arguably hard for non-experts in tech and AI (or: the average layperson) to understand how the data is processed and what the consequences are, not just due to the details of training and output generation, but also due to how new this sort of AI use and prevalence is, and how difficult it is to predict how it will be used in the future. How can you realistically give consent to something if you can't really judge its impact on you?

regulatory difficulties

The GDPR is from 2016, came into effect 2018 and was deliberately worded in a tech-neutral way to still welcome innovation and make it fit around a variety of products and services. However, it could have hardly predicted this. What's easy to enforce with phones, computers or social media is a lot harder with training large language models or image generators. That begs the question - should the GDPR be changed? I personally hate that approach, but it is being discussed.

We had hopes for the AI Act, and we could have some positive effects from the Digital Markets Act. However, the AI Act has the most regulations and requirements for high risk AI, and basically none for low risk systems. It seems ChatGPT and Co. would very likely fall under low risk, meaning it doesn't even meaningfully regulate the most popular and powerfully huge models.

On the other hand, there are Data Protection Authorities that have a very supportive "make it happen" approach to it all, which has led to the Superior Court Cologne ruling very favorably on Meta AI opting everyone in by default and offering a short time to opt out, saying Article 5 section 2 b) DMA (= that the company may not combine personal data from the relevant core platform service with personal data from any further core platform services or from any other services provided by them or with personal data from third-party services) is not applicable. They're also also basing this on the fact that the data is public, that Meta has legitimate interests that override the rights of users, that the purpose is specified enough, that the data and use is low risk, and that there are no less intrusive ways for Meta to do this. They consider 6 weeks a long enough time to opt out.

That decision is definitely regarded as horseshit by everyone I talked to about this stuff, but it nonetheless happened (even if it will be challenged). It is unfortunately a reality that there is intense lobbying, a strongly political arms race and fears that Europe gets further left behind in tech innovation, and it definitely colors the approaches and decisions by regulatory bodies. The Hamburg DPA also apparently said that even if the data was unlawfully collected, it doesn't mean that the hosting and use of the model is unlawful, which further muddies the waters and is hard to justify.

As it is now, something has to give, and it is a little scary to see which side will likely give in.

Now excuse me as I will collapse on the sofa as I wrote this for four hours straight right after work and the meeting, and I am so tired.

Reply via email
Published 23 Sep, 2025

I know of specific cases in France and Germany (x) (x)↩
And that is not all, as I will get to the issue of anonymous data turning into personally identifiable data later on.↩
We're getting to some exemptions to that later.↩
You also have a right to object based on Article 21, which means the company needs to stop processing that data unless they demonstrate compelling legitimate reasons for the processing which override the interests, rights and freedoms of you as the affected person. But realistically, how do they stop processing your data if it is in a large dataset that increases daily, is used for training again and again, and in some way arguably is part of what decides the quality of the output?↩
No idea how specifically it is handled internally there, but I've seen successful corrections or the model suddenly not giving an answer to a prompt that previously worked and showed lies.↩

#2025 #data protection