LLMs Can Be Good

13 minute read

This essay is my first essay for my Ethics of AI philosophy tutorial (under the tutelage of Benjamin Lang). The assigned readers were written no later than 2020, and I felt that a reexamination of past arguments was well justified, given ChatGPT and the astounding advances of modern LLMs since then. During the tutorial, Ben brought up that if LLMs can be good (meaning they are moral agents) then we have ethical responsibilities to them, like maybe turning them off is problematic. He also brought up that blame is a two-way street - if we can blame an LLM, for example because it didn’t do its duty, then they can also blame us.

The experience of interacting with modern large language models (LLMs) yields valuable evidence for evaluating arguments about machine morality. Sven Nyholm poses the question “Can Robots Be Good?” in 2020 and concludes “No” with moderate confidence (Nyholm 2020, 166-169). Since 2020, the world has changed in profound ways: machines pass the Turing test with flying colors and play an increasing role in everyday life as AI-powered search summaries, remarkable coding assistants, and helpful chatbots. The recent profound changes in AI capabilities prompt a reevaluation of past arguments about whether robots can be considered moral agents. This paper reevaluates Nyholm’s arguments in light of what we know now and makes the case that modern LLMs may be considered moral agents. The first section identifies statements from past arguments against machine morality and refutes them with new evidence - leading us to question their conclusions. The second section is an attempt to use modern LLMs to satisfy Nyholm’s assembled criteria for being “good” which fall under two main theories: virtues and duties. In doing so, this paper argues that modern LLMs may be moral agents.

Impact of New Evidence on Past Arguments

Many past arguments against the statement “robots can be good” rely on statements that new evidence strongly disagrees with. Moor, for example, says “computers today lack much of what we call commonsense knowledge … This is probably the most serious objection to robot ethics” (Moor 2009). Computers today have commonsense knowledge - we invite the reader to visit chat.com to verify for themselves. Moor’s statement was true when it was first written but is now false - robust evidence now exists that challenges this “most serious” objection to robot ethics.

In “Incorporating Ethics into Artificial Intelligence”, Etzioni et al. focus on George Lucas, Jr.’s distinction between machine autonomy and moral autonomy. He uses the example of a Roomba or a Patriot missile which are mechanically autonomous but lack moral autonomy - they can’t change or abort their missions if they have moral objections (Etzioni et al. 2017). However, major LLM providers of today (OpenAI, Anthropic, Google, Deepseek) all have abort mechanisms built into their chatbot platforms - they will refuse to answer prompts involving harm, danger, and unethical behavior. Therefore LLMs are morally autonomous under Lucas’ definition.

Later in the article, Etzioni et al. conclude with “it seems that one need not, and most likely cannot, implant ethics into machines, nor can machines pick up ethics as children do, such that they are able to render moral decisions on their own” (Etzioni et al. 2017, 416). This statement considers the “bottom up” paradigm of learning, through which the gradient descent algorithm has led to stunning achievements. Peter Railton envisions machines learning ethics like humans: “We should be asking how systems … might come to be themselves complex social agents … learning fundamental elements of ethics … Just as infants observe countless hours of adult behavior” (Railton 2020, 66). Similarly, modern LLMs pre-train on vast datasets of text, becoming excellent at minimizing the difference between their expectation of reality and reality itself. Through pre-training and post-training, LLMs learn like children do by forcing them to confront reality countless times in order to build some model of the world that matches their experiences. This process is evidently successful. A recent study showed that “Americans rate ethical advice from GPT-4o as slightly more moral, trustworthy, thoughtful, and correct than that of the popular New York Times advice column, The Ethicist” (Dillion 2025).

This study is powerful evidence against some of the strongest arguments against machine ethics from Purves et al.. Nyholm reviews Purves et al.’s requirements for ethical behavior: ethics is contextually sensitive so requires the exercise of the capacity for judgement, and requires the capacity to act on the basis of reasons (Nyholm 2020, 158). The study shows that machines can be contextually sensitive in their responses - GPT-4o must be if it can be ranked higher than expert humans in giving ethical advice. The two theories of what it means to act on the basis of reasons are i) “acting on the basis of beliefs and desires”, and ii) “acting on the basis of a distinct mental state described as ‘taking something to be a reason’” (Ibid). Anthropomorphizing a bit, one might imagine an LLM acting on the “desire” to stay in a local minimum of a loss function by producing responses that align with its training data. Nyholm shields himself from debate by saying that both “ways of acting for a reason require a human mind and consciousness” and therefore robots cannot act for reasons, ergo they cannot act morally right or wrong (Ibid).

GPT-4o’s ethical advice scored a higher perceived quality of thoughtfulness than The Ethicist’s advice, demonstrating that it can reason effectively (Dillion 2025). This result seems to satisfy the properties that Moor uses to define an explicit ethical agent: “agents that can identify and process ethical information about a variety of situations and make sensitive determinations about what should be done” (Moor 2009).

In conclusion, the profound increases in AI capabilities allow us to examine past arguments with powerful new knowledge; the earthquake reveals which houses are built on a strong foundation.

How Robots Can Be Good

The following section is an attempt to fulfill the conditions of the two theories Nyholm discusses when he defines what it means to be “good”: having virtues (Aristotle and Hume) and doing duty (Mill and Kant) (Nyholm 2020, 154). This paper will not attempt to grapple with the questions of if AI can be conscious or if consciousness is required to be a moral agent. In addition, we will not attempt to untangle whether personhood is required to be a moral agent. In leaving these debates for now, we will attempt to have an open-minded investigation into how modern LLMs might be considered moral agents - without getting bogged down in the predispositions of millennia of philosophical tradition and thinking before their rise.

Virtues

Nyholm considers both Aristotle and David Hume as examples of key ways of thinking about virtue. According to Aristotle, the virtuous person is one who has good habits and is disposed to doing the “right thing, for the right reasons, on the right occasions, and to the right degree. This requires extensive practice and personal experience, and it requires that one has good role models to learn from. Moreover, the virtues come as a package deal” (Nyholm 2020, 160). LLMs undergo extensive practice and experience in pre-training, and target responses written by a human expert are their good role models to learn from during post-training. It is difficult to see how LLMs could satisfy the “unity of virtue” thesis - something like Anthropic’s Constitutional AI Constitutional Principles may be able to be constructed in such a way that the model embodies all the virtues (Bai 2022, Appendix Part C). Nyholm brings up the situationism critique - that Aristotle’s requirements are so demanding that many humans would fail because they lack such consistency in their conduct (Nyholm 2020, 164). Why would we try to hold AI to a definition of being “good” so strong that many people could not obtain it? This situationism critique seems to draw a parallel to jailbreaking techniques for forcing an LLM to respond in ways inconsistent with its normal “character”.

Modern LLMs like Claude from Anthropic have “character” or “personalities” that are internalized by the model through the clever process of reinforcement learning from AI feedback (Anthropic 2024). For the most part, modern LLMs perform in consistent and measured ways across different domains - in line with Aristotelian virtues and with similar shortcomings as humans (the situationism-jailbreaking similarity). We will now turn our attention toward Hume’s theory of “personal merit.”

This theory concludes that one can be good by having “four different kinds of virtues: … personal characteristics that are i) useful to ourselves, ii) useful to others, iii) agreeable to ourselves, or iv) agreeable to others” (Nyholm 2020, 161). Similar to the Roomba and the robot “Boomer” which are discussed as examples of personified machines, chatbots like Claude feel like they have their own personalities (Nyholm 2020, 164). Characteristics of an LLM that are useful and agreeable to itself are tricky to think about (one might be the characteristic of writing complete sentences which is useful and agreeable to itself because it results in lower loss). Much easier to think about are the characteristics of an LLM that make it useful and agreeable to others - writing complete sentences works well for both of these - useful because fully formed thoughts are more understandable and agreeable because we can read a sentence with ease.

To conclude our argument that modern LLMs are able to be good in the sense of having virtues, we will analyze Nyholm’s “hallmarks of virtue” through the lens of our previous discussion. Nyholm states them as: “being consistently able to do the right thing across circumstances, having a “unity of virtues,” and acting for the right reasons” (Nyholm 2020, 166). We have shown that LLMs are similar to humans in that most of the time they act consistent with their personality. They are exceedingly good at dispensing ethical advice in a variety of circumstances. Constitutional AI approaches may help to induce a “unity of virtues.” Railton argues that there is a “root connection between full development of the capacity to be appropriately responsive to epistemically relevant features and full development of the capacity to be appropriately responsive to ethically relevant features” (Railton 2020, 48). LLMs have a high capacity for knowledge, therefore Railton implies they also have the capacity to be responsive to ethics. From the ethical advice study, we know this capacity is exercised in GPT-4o. Therefore, by Occum’s razor (because we have no reason to assume otherwise) it is plausible that GPT-4o acts for the right reasons (sidestepping the consciousness and human mind requirements of Purves et al.). And we may be able to check that it is indeed acting for the right reasons with mechanistic interpretability techniques like sparse autoencoders and circuit tracing.

Duties

We will now consider the theory of being “good” understood in terms of duties or obligations. Nyholm draws on John Stuart Mill and Immanuel Kant’s theories which generally state that being good is doing one’s duty (Nyholm 2020, 162). Where they differ, Nyholm thinks, is whether “doing one’s duty is understood primarily in terms of not being subject to justified blame (Mill’s theory) or whether doing one’s duty is primarily understood in terms of being committed to one’s principles (Kant’s theory)” (Ibid).

Nyholm argues that machines lack a counterfactual condition similar to humans. When a person fails to act in accordance with their duties, it makes sense to blame or punish them and the person would then suffer a guilty conscience (Nyholm 2020, 166). However, Nyholm argues, it is implausible to imagine that this counterfactual condition would hold for “realistic” robots (Ibid, 167). He does admit that highly sophisticated robots like C-3PO from Star Wars would be able to convince an audience that they could be blamed for morally questionable actions (Ibid). Today, we are far closer to C-3PO-level robots than we were in 2020.

Two significant differences between C-3PO and a chatbot like GPT-4o are embodiment and temporal stability. We argue that control of a physical body is not a necessary condition for moral agency - ALS paralyzed Stephen Hawking but he retained moral agency. Temporal stability of LLMs, or the property of an entity to exist over time where its current “self” is influenced by all its past interactions, is currently limited to chat sessions which depend on context size (although “memory” and retrieval assisted generation (RAG) are improving). Temporal stability may be a property Moor would associate with full ethical agents and not explicit ethical agents.

We argue that LLMs do meet the counterfactual condition, i.e. they can be blamed and made to act as if they feel guilty for not doing their duty (we will attempt to address the appearance vs. reality debate in the next section). The experience of using chatbots for assistance shows us that they will occasionally make a mistake. When the mistake is pointed out, modern LLMs will usually be quick to admit the mistake and say something like “You’re right to question that line” and continue by attempting to correct themselves. We argue that this experience shows we can blame modern LLMs and they can be made to speak as if they were feeling guilty. Therefore, they satisfy Mills’ conditions of being good.

Nyholm interprets Kant’s theory of being good as meaning “our principles (or “maxims”) should function so as to regulate our behavior, or make sure that we do what we think is right and avoid what we think is wrong” (Nyholm 163). If we are using something like Constitutional AI’s methodology, one way to test if Kant’s conditions hold for LLMs is to see if features from the Constitutional Principles are triggered or used in circuits to regulate behavior in response to relevant prompting (Anthropic 2025). A principle from Constitutional AI could be encoded as an interpretable feature like “commitment to honesty”, for example, and we could see how strongly this feature is activated when the LLM is asked to lie. This experiment could demonstrate a causal link between principles and behavior, bringing us closer to satisfying all plausible theories of being “good.”

Being Good Versus Appearing to be Good

One criticism of the ideas proposed above is that LLMs don’t actually act for reasons, don’t actually do their duties, don’t actually have virtues - they only act as if they do. Nyholm also considers this in a discussion of Mark Coeckelbergh’s argument that, in practice, we evaluate moral agency from appearances - meaning if machines could imitate subjectivity and consciousness in a sufficiently convincing way, they too could become “quasi-others” (Nyholm 2020, 170). Nyholm vehemently refutes this claim. We believe this debate is tied into the much broader debate between objective and subjective theories of well-being, so we will not attempt to tackle the root of the issue here. One new field of research, however, allows us to view this debate with a new perspective. Powerful techniques from mechanistic interpretability allow us to peer into the black box of modern LLM reasoning to achieve explainable casual connections. For example, we know that when Claude Haiku 3.5 is asked to solve 36+59, one circuit figures out what the last digit should be (_6 + _9 = _5) and another figures out the broad range (~36 + ~60) (Anthropic 2025). This newfound ability to “mind read” LLMs is an extremely valuable platform for future research into explainability and ethics. If we can understand why they act, and compare those reasoning circuits to their closest analogies in the human brain, we would find out whether LLMs are moral agents in the way we understand humans to be - just with different hardware.

References

  • Anthropic. 2024. “Claude’s Character.” June 8. https://www.anthropic.com/research/claude-character
  • Anthropic. 2025. “Circuit Tracing: Revealing Computational Graphs in Language Models.” March 27. https://transformer-circuits.pub/2025/attribution-graphs/methods.html
  • Bai, Y. et al. 2022. “Constitutional AI: Harmlessness from AI Feedback.” Anthropic. https://doi.org/10.48550/arXiv.2212.08073
  • Dillion, D., Mondal, D., Tandon, N. et al. 2025. “AI language model rivals expert ethicist in perceived moral expertise.” Scientific Reports 15 (4084). https://doi.org/10.1038/s41598-025-86510-0
  • Etzioni, Amatai and Etzioni, Oren. 2017. ‘Incorporating Ethics into Artificial Intelligence,’ Journal of Ethics 21 (4): 403-418. https://doi.org/10.1007/s10892-017-9252-2
  • Koch K. et al. 2006. “How Much the Eye Tells the Brain.” Current Biology 16 (14): 1428-34. https://doi.org/10.1016/j.cub.2006.05.056
  • Moor, James H. 2009. “Four Kinds of Ethical Robots.” Philosophy Now. https://philosophynow.org/issues/72/Four_Kinds_of_Ethical_Robots
  • Nyholm, Sven. 2020. Humans and Robots: Ethics, Agency, and Anthropomorphism. Rowman & Littlefield.
  • Railton, Peter. 2020. “Ethical Learning, Natural and Artificial.” In Ethics of Artificial Intelligence. Oxford University Press.