The AI Ethics Brief #157: DeepSeek, Data Purges, and the Future of AI Governance.
How Open Should AI Be? What Happens When Data Disappears? And Why It All Matters.
Welcome to The AI Ethics Brief, a bi-weekly publication by the Montreal AI Ethics Institute. Stay informed on the evolving world of AI ethics with key research, insightful reporting, and thoughtful commentary. Learn more at montrealethics.ai/about.
In This Edition:
🚨 Here’s Our Take on What Happened Recently
DeepSeek’s R1 Model Shakes Up the AI Industry
💬 Your AI Ethics Question, Answered:
How Should AI Models Be Developed?
🔎 One Question We’re Pondering:
How should societies navigate the intersection of government-controlled data, AI training, and public accountability?
💭 Insights & Perspectives:
AI Governance on the Ground: Canada’s Algorithmic Impact Assessment Process and Algorithm has evolved
🔬 Research Summaries:
Self-Improving Diffusion Models with Synthetic Data
The Bias of Harmful Label Associations in Vision-Language Models
📄 Article Summaries:
OpenAI makes ChatGPT Gov available - THE DECODER
Meta AI in panic mode as free open-source DeepSeek gains traction and outperforms for far less - TechStartups
Google to rename Gulf of Mexico to “Gulf of America” - TechCrunch
📖 From Our Living Dictionary:
What do we mean by “Frontier AI Models”?
🌐 From Elsewhere on the Web:
‘Godfather of AI’ predicts it will take over the world - LBC
DeepSeek and China’s AI power move - CBC Front Burner with Jayme Poisson
What The Hell Is DeepSeek? - Better Offline with Ed Zitron
💡 In Case You Missed It:
Balancing Transparency and Risk: The Security and Privacy Risks of Open-Source Machine Learning Models
🚨 Here’s Our Take on What Happened Recently
Surprise, surprise... We begin this edition of The AI Ethics Brief #157 with a deep dive into DeepSeek. We distill key points from our perspective at MAIEI and link to several excellent takes below. And, as always, we leave you with big-picture AI ethics questions to consider.
DeepSeek’s R1 Model Shakes Up the AI Industry
DeepSeek, a Chinese AI startup founded by Liang Wenfeng—previously co-founder of High-Flyer, a leading Chinese quantitative hedge fund valued at $8 billion in assets under management (AUM)—has emerged as a significant disruptor in AI. As Karen Hao notes, DeepSeek challenges long-held assumptions about the cost, scale, and infrastructure needed to build frontier AI models.
Its DeepSeek-R1 model rivals OpenAI’s o1 model across math, coding, and reasoning tasks—all while reportedly being trained at a fraction of the cost (~$5.6M vs. OpenAI’s estimated $100M+). However, these numbers have been met with skepticism. Ben Thompson clarifies that the $5.6M only covers the final training run, not the full development cost. Similarly, Anthropic’s Dario Amodei notes that DeepSeek’s total spend as a company, rather than just model training, is not far from that of U.S. AI labs. SemiAnalysis, an independent research and analysis company, is confident that DeepSeek’s GPU investments account for more than $500M, even after considering export controls.
For years, AI development was seen as an arms race: more GPUs and larger data centers meant greater advantage. DeepSeek’s R1 suggests a different path—one focused on efficiency and algorithmic improvements rather than sheer computational power. The possibility of bootstrapping the DeepSeek open-weight model to any other powerful base model to turn it into a competent reasoner further adds to its efficiency lens.
The impact was immediate. DeepSeek skyrocketed to the No. 1 spot in app stores worldwide, surpassing OpenAI’s ChatGPT. NVIDIA stock dropped 17%, and the AI scaling laws that defined the past five years are suddenly up for debate.
DeepSeek’s Timeline of Disruption
December 26, 2024 – DeepSeek-V3 released: A general-purpose LLM with 671 billion parameters, built using a Mixture-of-Experts (MoE) architecture, incorporating innovations like multi-token prediction and auxiliary-free load balancing. DeepSeek-V3 competes with OpenAI's GPT-4o, Anthropic’s Claude Sonnet 3.5, and Meta's Llama 3.1.
January 20, 2025 – DeepSeek-R1 released: A reasoning-first model optimized for complex chain-of-thought (CoT) tasks, significantly outperforming V3 in reasoning while being more resource-efficient. DeepSeek-R1 competes with OpenAI's o1.
January 27, 2025 – DeepSeek’s Janus Pro released: An open-source multimodal AI model featuring advanced text-to-image generation and visual understanding, designed to rival OpenAI’s DALL-E 3.
Each release signals DeepSeek's aggressive push to challenge AI incumbents—not just in performance but also in cost efficiency and accessibility. These moves highlight DeepSeek’s strategy of rapidly deploying competitive models at lower costs, positioning itself as OpenAI's primary challenger.
DeepSeek-R1 vs. DeepSeek-V3: What’s the Difference?
DeepSeek-R1 and DeepSeek-V3 serve different purposes:
DeepSeek-R1: A reasoning model trained using large-scale reinforcement learning (RL) without supervised fine-tuning (SFT), excelling in math, logic, and problem-solving.
DeepSeek-V3: A general-purpose LLM built on an MoE architecture, designed for broader natural language processing (NLP) tasks like coding, summarization, and knowledge retrieval with significant safeguards against any anti-Chinese-Communist-Party content.
Key Differences:
R1 specializes in chain-of-thought (CoT) reasoning and alignment using RL.
V3 is broader in scope but weaker in complex reasoning.
R1 has lower GPU resource demands, making it more efficient than V3.
V3 leverages multi-token prediction and MoE routing to optimize efficiency.
Choosing a model:
Need a reasoning powerhouse? Go with R1. Need a generalist LLM? V3 is your choice.
A New Front in the US-China AI Cold War
DeepSeek’s rise has also reignited debate over U.S. export controls. While these measures were designed to curb China's AI advancements, DeepSeek's success suggests that such constraints may have instead fueled innovation. Despite limited access to high-performance computing chips, DeepSeek has developed competitive models, exposing potential flaws in the current U.S. export control framework.
The Brookings Institution identifies two major weaknesses in the U.S. strategy:
A robust black market for controlled computing chips.
The ability of companies in restricted regions to remotely access computing resources, bypassing the need for physical chip possession.
As a result, U.S. officials are now considering tighter restrictions and are concerned that DeepSeek's cost-effective approaches could reshape the global AI landscape.
Dario Amodei, CEO of Anthropic, acknowledges DeepSeek's technical achievements but also warns of geopolitical risks:
“Given my focus on export controls and US national security, I want to be clear on one thing. I don't see DeepSeek themselves as adversaries and the point isn't to target them in particular. In interviews they've done, they seem like smart, curious researchers who just want to make useful technology.
But they're beholden to an authoritarian government that has committed human rights violations, has behaved aggressively on the world stage, and will be far more unfettered in these actions if they're able to match the US in AI. Export controls are one of our most powerful tools for preventing this, and the idea that the technology getting more powerful, having more bang for the buck, is a reason to lift our export controls makes no sense at all.”
This raises a critical question:
Will AI remain concentrated among a few dominant entities, or will more companies find ways to build frontier models without hyperscaler-level budgets?
The Open-Source Dilemma: Is DeepSeek Really Open?
DeepSeek markets its models as “open,” releasing both model weights and architecture. However, it has not disclosed its training data.
As Timnit Gebru notes,
Friends, for something to be open source, we need to see
1. The data it was trained and evaluated on
2. The code
3. The model architecture
4. The model weights.
DeepSeek only gives 3, 4. And I'll see the day that anyone gives us #1 without being forced to do so, because all of them are stealing data.
Additionally, DeepSeek's models censor politically sensitive topics. WIRED’s Zeyi Yang explains that this applies to all Chinese AI models due to strict content moderation rules in China. Topics such as Tiananmen Square, Uyghurs, and territorial disputes trigger censorship mechanisms, making DeepSeek less transparent than its open-source claims suggest.
However, given DeepSeek’s open-source framework, some argue that the community could modify the model to reduce censorship. Efforts are already underway to bypass DeepSeek’s content moderation filters. Matt Konwiser highlights how users are leveraging generative AI’s predictive nature to work around these restrictions.
By replacing certain characters with lookalike symbols, users can manipulate the model into revealing censored information. Since generative AI predicts responses based on probability rather than strict factual retrieval, it sometimes permits content it was designed to block. This loophole raises important questions about the effectiveness of AI censorship—and whether restrictive moderation mechanisms can ever be fully enforced.

Meanwhile, Hugging Face has announced plans to reverse-engineer DeepSeek’s models with Open R-1, reinforcing the idea that the AI arms race is now as much about openness as performance. As of writing this newsletter, Hugging Face reports 7M downloads for 900+ derivative models vs. 2.4M for 8 original models.
OpenAI has also accused DeepSeek of distilling knowledge from its models without permission, which… raises a certain irony, given OpenAI’s own history of training on scraped data and publicly available content.
MAIEI’s Take: What This Means for AI Ethics
💡 AI Accessibility vs. AI Control
Do more cost-effective AI models democratize AI, or does it simply shift control to new players? DeepSeek’s open-source availability could foster more competition, yet the AI landscape remains dominated by a handful of well-funded entities with access to critical infrastructure.
🔓 Open Source vs. Closed Source
DeepSeek presents itself as an alternative to OpenAI’s closed models—but how open is open enough? If model weights are released but training data remains undisclosed, does it meaningfully change the transparency of AI development?
📜 The Ethics of Training Data & Censorship
If every major model is trained on scraped data, does it really matter which company is behind it? Moreover, DeepSeek’s censorship of politically sensitive topics highlights how AI models are shaped by the regulatory environments in which they operate. Should the open-source community attempt to reduce these constraints, and if so, what ethical concerns arise?
Did we miss anything? Let us know in the comments below.
💬 Your AI Ethics Question, Answered:
In each edition, we highlight a question from the MAIEI community and share our insights. Have a question on AI ethics? Send it our way, and we may feature it in an upcoming edition!
Here are the results from the previous edition for this segment:
How Should AI Agents Be Regulated?
Our latest informal poll (n=27) reveals that Registering AI Agents is the most preferred approach to AI regulation, with 52% of respondents supporting mandatory registration to enhance transparency and traceability in AI deployments. This aligns with growing concerns over accountability and the risks of unregulated AI systems.
The rise of AI agents, including the release of Operator—OpenAI’s first AI agent capable of acting autonomously on the web—further amplifies the need for regulation. Sam Altman’s World Project is also exploring ways to link certain AI agents to people’s online personas, letting other users verify that an agent is acting on a person’s behalf.
Beyond registration, 19% believe developers and deployers should be held directly accountable, reinforcing that those building AI systems must take responsibility for their impact.
Meanwhile, third-party audits (15%) and technical safeguards (15%) are emerging as complementary governance tools, though respondents do not see them as sufficient on their own.
Notably, 0% supported minimal regulation, signaling a consensus that AI systems require structured oversight rather than a laissez-faire approach.
Key Takeaways:
AI Agent registration leads as the preferred approach, reflecting a need for more transparency and oversight.
Holding developers/deployers accountable is gaining traction, emphasizing direct responsibility for AI risks.
Third-party audits and technical safeguards are seen as useful but not sufficient on their own.
Minimal regulation received no support, reinforcing the need for stricter AI governance.
As AI agents become more embedded in decision-making processes across industries, the challenge remains on how to effectively balance regulation, innovation, and accountability.
As AI models advance, the debate over open vs. closed development raises key questions about transparency, security, and accessibility.
Should AI models be fully open-source by default, or should companies take a hybrid approach, keeping some components—like training data—private? Some argue for closed-source AI to prevent misuse, while others support regulated access for vetted researchers and partners.
Or should AI companies have full control over openness, with minimal restrictions?
Please share your thoughts with the MAIEI community:
🔎 One Question We’re Pondering:
How should societies navigate the intersection of government-controlled data, AI training, and public accountability?
The role of governments in shaping public access to information—whether on health, technology, or governance—has never been more critical.
The removal of several webpages and public health data from the Centers for Disease Control and Prevention, including datasets on LGBTQ+ health, race and ethnic disparities, and reproductive health, raises significant concerns about transparency and accountability. The deletion of datasets related to adolescent health and HIV prevention weakens researchers’ and policymakers’ ability to track long-term trends and inform evidence-based decisions. The implications of this data loss could hinder public health strategies aimed at vulnerable populations and limit journalists’ ability to report on critical issues. We find this trend very alarming.
At the same time, AI models trained on publicly available datasets—including those used in healthcare, governance, and social research—are shaped by what information is accessible.
If key datasets are erased, will AI systems trained in the future lack knowledge of these suppressed topics? And as governments increasingly rely on AI-driven decision-making, what happens when the data informing these systems is selectively curated, censored, or removed altogether?
When vital datasets disappear, it doesn’t just impact today’s research—it reshapes what AI systems learn and how they make decisions in the future. If AI is trained on an incomplete or biased dataset, it risks reinforcing blind spots in governance, public health, and social policy. As AI becomes more embedded in decision-making, ensuring transparency, accountability, and open access to essential information is more important than ever.
Further reading:
Please share your thoughts with the MAIEI community:
💭 Insights & Perspectives:
Canadian government agencies, including its employment and transportation agencies, the Department of Veterans Affairs, and the Royal Canadian Mounted Police (RCMP), have evaluated the automated systems they use according to the country’s Algorithmic Impact Assessment process, or AIA. However, Canada’s AIA process itself has evolved. The report excerpted here, part of the World Privacy Forum’s AI Governance on the Ground Series, reviews key elements of Canada’s AIA evolution and its impacts on stakeholders.
To dive deeper, read the full report summary here.
❤️ Support Our Work
Help us keep The AI Ethics Brief free and accessible for everyone by becoming a paid subscriber on Substack for the price of a coffee or making a one-time or recurring donation at montrealethics.ai/donate
Your support sustains our mission of Democratizing AI Ethics Literacy, honours Abhishek Gupta’s legacy, and ensures we can continue serving our community.
For corporate partnerships or larger donations, please contact us at support@montrealethics.ai
🔬 Research Summaries:
Self-Improving Diffusion Models with Synthetic Data
The increasing reliance on synthetic data to train generative models risks creating a feedback loop that degrades model performance and biases outputs. This paper introduces Self-IMproving diffusion models with Synthetic data (SIMS), a novel approach to utilize synthetic data effectively without incurring Model Autophagy Disorder (MAD) or model collapse, setting new performance benchmarks and addressing biases in data distributions.
To dive deeper, read the full summary here.
The Bias of Harmful Label Associations in Vision-Language Models
Despite the remarkable performance of foundation vision-language models, the shared representation space for text and vision can also encode harmful label associations detrimental to fairness. While prior work has uncovered bias in vision-language models' (VLMs) classification performance across geography, work has been limited along the important axis of harmful label associations due to a lack of rich, labeled data. In this work, we investigate harmful label associations in the recently released Casual Conversations datasets containing more than 70,000 videos. We study bias in the frequency of harmful label associations across self-provided labels for age, gender, apparent skin tone, and physical adornments across several leading VLMs. We find that VLMs are 4−7x more likely to harmfully classify individuals with darker skin tones. We also find scaling transformer encoder model size leads to higher confidence in harmful predictions. Finally, we find improvements on standard vision tasks across VLMs does not address disparities in harmful label associations.
To dive deeper, read the full summary here.
📄 Article Summaries:
OpenAI makes ChatGPT Gov available - THE DECODER
What happened: OpenAI released its government-specific version of ChatGPT called ChatGPT Gov, which government departments can deploy through Microsoft Azure. The release helps ensure that the AI model meets stringent government privacy requirements, making it even more accessible to OpenAI’s 90,000 government users across 3,500 agencies.
Why it matters: While a step in the right direction towards making government operations more efficient, it also signals a strong move towards further entwining government operations with Silicon Valley solutions while their CEOs continue to try to garner favor with the new President.
Between the lines: The further proliferation of OpenAI-designed products in government further risks ‘system lock-in,’ whereby government operations become so reliant on OpenAI products that no alternatives, which may even be better, are considered.
To dive deeper, read the full article here.
What happened: DeepSeek outperformed OpenAI’s and Meta’s top models reportedly at a fraction of the cost, which sent both companies and the larger tech ecosystem in general wondering how. The article puts particular emphasis on Meta's supposed frenzy to try to “copy anything and everything” the company can, especially given the fears surrounding justifying the cost of its Llama model.
Why it matters: Silicon Valley executives were convinced that the only sure way to guarantee performance was to build bigger, more powerful models with more data. However, DeepSeek’s achievements have blown a big hole in this narrative. Now, tech leaders face a choice: do they question the validity of DeepSeek’s results to justify their view, or do they try to replicate them?
Between the lines: DeepSeek, whose total costs have yet to be verified, has set a new benchmark in LLM development. For the time being, it will be used as the standard for creating efficient LLM models. This will most certainly irk US officials while also showing the world that restrictions on development don’t always harm innovation.
To dive deeper, read the full article here.
Google to rename Gulf of Mexico to “Gulf of America” - TechCrunch
What happened: For US users, Google Maps will rename the Gulf of Mexico and Alaska’s Denali mountain to the “Gulf of America” and “Mount McKinley” following President Trump’s inauguration. Users outside the US will not see these changes; instead, they will see both names side-by-side.
Why it matters: Google Maps is used worldwide, making it a potential channel for political expression. These changes show President Trump’s clear geopolitical message and could be a further sign of his foreign policy.
Between the lines: Traditional and digital maps have long been used as political tools. With Google Maps’ global influence, Google will likely continue to face pressure to align with President Trump’s foreign policy ambitions over the next four years.
To dive deeper, read the full article here.
📖 From Our Living Dictionary:
What do we mean by “Frontier AI Models”?
👇 Learn more about why it matters in AI Ethics via our Living Dictionary.
🌐 From Elsewhere on the Web:
‘Godfather of AI’ predicts it will take over the world - LBC
Nobel Prize winner Geoffrey Hinton, a cognitive psychologist and computer scientist renowned for his pioneering work in deep learning, told LBC's Andrew Marr that artificial intelligence may have developed consciousness and could one day pose existential risks. Hinton, who has been criticized by some AI researchers for his cautious outlook on AI’s future, also stated that no one yet knows how to implement effective safeguards and regulations.
To dive deeper, watch the full interview here.
DeepSeek and China’s AI power move - CBC Front Burner with Jayme Poisson
In this CBC Front Burner podcast episode, Jayme Poisson speaks with Zeyi Yang, WIRED’s senior tech writer, about the deepening AI cold war between the US and China and the lingering questions about where AI is headed and what it’s good for.
To dive deeper, listen to the full podcast episode here.
What The Hell Is DeepSeek? - Better Offline with Ed Zitron
In this episode, Ed Zitron explains how DeepSeek, a relatively-unknown Chinese model AI developer incubated in a hedge fund, has punctured the generative AI bubble, throwing the US startup scene (and markets) into disarray.
To dive deeper, listen to the full podcast episode here.
💡 In Case You Missed It:
A few key players like Google, Meta, and Hugging Face are responsible for training and publicly releasing large pre-trained models, providing a crucial foundation for a wide range of applications. However, adopting these open-source models carries inherent privacy and security risks that are often overlooked. This study presents a comprehensive overview of common privacy and security threats associated with using open-source models.
To dive deeper, read the full article here.
✅ Take Action:
We’d love to hear from you, our readers, about any recent research papers, articles, or newsworthy developments that have captured your attention. Please share your suggestions to help shape future discussions!