<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://simonsocolow.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://simonsocolow.com/" rel="alternate" type="text/html" /><updated>2026-03-04T03:11:17+00:00</updated><id>https://simonsocolow.com/feed.xml</id><title type="html">Simon Socolow</title><subtitle>Hey there! I&apos;m Simon, welcome to my site!</subtitle><author><name>Simon Socolow</name></author><entry><title type="html">Academics and Athletics</title><link href="https://simonsocolow.com/blog/academics-and-athletics/" rel="alternate" type="text/html" title="Academics and Athletics" /><published>2025-09-07T00:00:00+00:00</published><updated>2025-09-07T00:00:00+00:00</updated><id>https://simonsocolow.com/blog/academics-and-athletics</id><content type="html" xml:base="https://simonsocolow.com/blog/academics-and-athletics/"><![CDATA[<p>Time and time again throughout my college career, I’ve thought to myself</p>
<blockquote>
  <p>“Why am I out here rowing?? I’m falling behind where I could be if I was studying and learning and making cool stuff instead!”</p>
</blockquote>

<p>Implicitly, I believed in a false dichotomy, a spectrum fallacy, an inaccurate mental map representing academics and athletics as engaged in a zero-sum game - to get better at one would require becoming worse at the other. I was wrong. And today, I hope to show you why.</p>

<h2 id="biology">Biology</h2>
<p>To investigate how athletics could complement and assist with academics, I did two Deep Research queries, one with <a href="https://chatgpt.com/share/68be02ee-1d80-8000-bbe5-63e3c3f28b9f">ChatGPT</a> and one with <a href="https://g.co/gemini/share/3d557bd753e5">Gemini</a>.</p>

<p>In both reports, a protein named <a href="https://en.wikipedia.org/wiki/Brain-derived_neurotrophic_factor">Brain-Derived Neurotrophic Factor (BDNF)</a> is identified as a key factor responsible for the physiological benefits of exercise on brain functioning. Exercise, in particular cardio, has been shown to significantly increase BDNF synthesis in the brain. Interestingly, this effect may be <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC4314337/">more pronounced in men than women</a>. BDNF supports the survival of existing neurons and encourages growth and differentiation of new neurons. Intuitively, I know that I feel better after working out. I also know that if I don’t exercise for a few days I become lethargic and generally unproductive.</p>

<h2 id="parkinsons-law">Parkinson’s law</h2>
<p><a href="https://en.wikipedia.org/wiki/Parkinson%27s_law">Parkinson’s law</a> states that “work expands so as to fill the time available for its completion”. Regularly scheduled practices (and to some extent study methods like <a href="https://en.wikipedia.org/wiki/Pomodoro_Technique">Pomodoro</a>) act as unavoidable deadlines and chunks of time you know are off-limits in advance. This theory is in accordance with the observed phenomena that out of season athletes feel like they get less done! “If you want something done, ask a busy person” - Ben Franklin (?).</p>

<h2 id="right-tool-for-the-job">Right tool for the job</h2>
<p>Gemini’s report included an interesting table explaining how one might utilize different types of exercise for different goals. For example, as we saw, cardio increases the production of BDNF. Therefore, you may want to go for a run before a study session during which your goals are to learn and ingest information into memory. Apparently (but this should be more thoroughly checked), strength training is an antidote to procrastination as it hightens executive function. And mind-body exercises like yoga or tai chi strengthen reasoning, attention, and problem-solving.</p>

<p>Not too long ago, I chatted over dinner with a friend about performance enhancing substances like caffeine. His quote struck me: “everything is a tradeoff”. You get to use the future now, but you’ll pay for it later. For an hour-long test with nothing else going on later that day, having the ability to use stimulants is awesome. It’s an option. And usually, the more options the better.</p>

<p>Just like being judicious with stimulants, knowing how certain foods will affect your body and energy levels is also a superpower. We talked about <a href="https://en.wikipedia.org/wiki/Glycemic_index">glycemic index (GI)</a> and how he avoids carbs at dinnertime because they have a high GI.</p>

<p>All these ideas center around one meta-idea: use the right tool for the job. And the winds of evidence point us towards the realization that exercise is a tool, a multifaceted tool, that is useful in doing the job of being a college student.</p>

<h2 id="diminishing-returns-of-exercise">Diminishing returns of exercise</h2>
<p>While there is strong evidence that <em>some</em> exercise is important and beneficial, there must be a point beyond which exercise is harmful. Imagine someone working out eight hours a day - not much time to do other things. Unfortunately, I agree with my uncle who said collegiate athletics’ “competitiveness leads to a constant escalation of the amount of time spent on workouts. We’re not very good at controlling that escalation and need to be more conscious of that.”</p>

<p>In game theory terms, I think this represents the Prisoner’s Dilemma. At Williams, a D3 NESCAC school, we have an emphasis on being a ‘<strong>student</strong>-athlete’ where the <em>student comes first</em>. Meaning that academics should always have priority over athletics. However, I don’t think this is internalized enough by some of my teammates and possibly coaches. This is my attempt at representing the situation (where Athlete-Student means athletics is being prioritized over academics):</p>

<p><img src="/assets/images/athleticspd.png" alt="athletics pd" /></p>

<p>So if everyone had the same <em>student</em>-athlete time commitment for athletics, the student-athlete vision would be possible. And there are NESCAC rules for this like no mandatory practices out of season. However, in practice (haha), there are de-facto mandatory practices during the offseason. Everyone who wants to win has an incentive to push for more time practicing - quality time spent doing sport seems to be well-correlated with performance in sport. I think to solve this, coaches need an incentive for academic performance to balance the tug-of-war that currently appears one-sided.</p>]]></content><author><name>Simon Socolow</name></author><category term="blog" /><category term="athletics" /><summary type="html"><![CDATA[Time and time again throughout my college career, I’ve thought to myself “Why am I out here rowing?? I’m falling behind where I could be if I was studying and learning and making cool stuff instead!”]]></summary></entry><entry><title type="html">Did Exeter College Boat Club Invent the Tie?</title><link href="https://simonsocolow.com/blog/ecbc-inventing-tie/" rel="alternate" type="text/html" title="Did Exeter College Boat Club Invent the Tie?" /><published>2025-08-24T00:00:00+00:00</published><updated>2025-08-24T00:00:00+00:00</updated><id>https://simonsocolow.com/blog/ecbc-inventing-tie</id><content type="html" xml:base="https://simonsocolow.com/blog/ecbc-inventing-tie/"><![CDATA[<p>At the Tuft’s college boathouse, I spotted a book - “Rowing Blazers” by Jack Carlson. Naturally, having spent the past year rowing for Exeter College Boat Club (ECBC) at Oxford (and experiencing many of the traditions which go on there including blazers) I was interested. I found the Exeter College page:
<img src="/assets/images/ecbctie.jpg" alt="ecbc tie page" />
Wow! We invented the tie??? I had never heard of this before. Asking my friends on the team, they had also never heard this story. Digging a little deeper, the ECBC Wikipedia page supports this idea:</p>
<blockquote>
  <p>The first known use of a tie in club colours was by members of Exeter College eight. In 1880, they took the ribbons off their boaters and tied them around their necks as a way to identify with their college.[9]</p>
</blockquote>

<p>Ok, and what is this source? A wayback machine record of a silk tie designer’s article on the history of formal wear. It says:</p>

<blockquote>
  <p>In 1880, the rowing club at Oxford University’s Exeter College One men’s club, invented the first school tie by removing their ribbon hat bands from their boater hats and tying them, four-in-hand. When they ordered a set of ties, with the colours from their hatbands, they had created the modern school tie. School, club, and athletic ties appeared in abundance. Some schools had different ties for various grades, levels of achievement, and for graduates.</p>
</blockquote>

<p>There is probably a grain of truth here, so I wanted to dig deeper. <a href="https://chatgpt.com/share/68ab69d0-2d38-8000-9adc-352c5c070840">Deep research time</a>. This returned more sources in support of ECBC’s invention of the first college or club tie. The tie as an element of fashion had existed, but ECBC rowers appear to have been the first to use a tie in their colors as a marker of identity. Further sources supporting this claim are an <a href="https://www.oxfordstudent.com/2011/01/20/the-ox-idental-tourist-the-tie-collection-the-bear-inn/#:~:text=Fittingly%2C%20the%20origins%20of%20school,made%20versions%20for%20its%20members">Oxford Student article</a> about the Bear Inn (where there are ties on the ceiling), <a href="https://ecbc.web.ox.ac.uk/traditions#:~:text=The%20first%20club%20ties%20of,wore%20them%20around%20their%20necks">ECBC’s own website</a>, and two more <a href="https://turnbullandasser.com/blogs/off-the-cuff/off-the-cuff-history-of-neckwear#:~:text=Ties%20have%20long%20been%20used,fastened%20them%20around%20their%20necks">pages on the history</a> of <a href="https://www.tiesncuffs.com.au/pages/the-history-of-the-tie?srsltid=AfmBOooEPpt7kd5C4URFDd0Okz_N1gCw-76KDIfhGcJNPXbUby1kl_ee#:~:text=school%2C%20etc,and%20clubs%20to%20follow%20suit">neckware</a>.</p>

<p>Now I wonder how these sources are related. If all but one of them blindly copied the one, then my confidence in this idea should not be strong. But if these sources are independent in that they do not copy each other but place one of many trickled-down paths from history onto the Internet, then my confidene is strong. By default, I do believe this is true because of Occam’s razor.</p>]]></content><author><name>Simon Socolow</name></author><category term="blog" /><category term="history" /><category term="athletics" /><summary type="html"><![CDATA[At the Tuft’s college boathouse, I spotted a book - “Rowing Blazers” by Jack Carlson. Naturally, having spent the past year rowing for Exeter College Boat Club (ECBC) at Oxford (and experiencing many of the traditions which go on there including blazers) I was interested. I found the Exeter College page: Wow! We invented the tie??? I had never heard of this before. Asking my friends on the team, they had also never heard this story. Digging a little deeper, the ECBC Wikipedia page supports this idea: The first known use of a tie in club colours was by members of Exeter College eight. In 1880, they took the ribbons off their boaters and tied them around their necks as a way to identify with their college.[9]]]></summary></entry><entry><title type="html">Uploading SVGs to google slides</title><link href="https://simonsocolow.com/tech/uploading-svg-to-google-slides/" rel="alternate" type="text/html" title="Uploading SVGs to google slides" /><published>2025-08-03T00:00:00+00:00</published><updated>2025-08-03T00:00:00+00:00</updated><id>https://simonsocolow.com/tech/uploading-svg-to-google-slides</id><content type="html" xml:base="https://simonsocolow.com/tech/uploading-svg-to-google-slides/"><![CDATA[<p>I’m working with a friend, making a poster in google slides. We have a bar chart to add to the poster, so naturally we download it as an SVG (scalable vector graphics - i.e. it never gets pixelated no matter how far you zoom in) and upload it to google slides.
<img src="/assets/images/gslides-svg.png" alt="error uploading" />
Oof. What???!!! Google products are probably the most used products by any company anywhere in all of history. How on Earth do they not allow people to upload SVGs? I was dumbstruck upon seeing this for the first time because I remembered uploading SVGs successfully in the past. It turns out that  SVGs are a security threat because they can contain javascript, so at some point in 2021 Google disallowed their use. I can’t find any official announcement about this, the closest thing I found is <a href="https://support.google.com/docs/thread/103766233/images-in-svg-format-no-longer-supported?hl=en">this support forum</a>. Also, the SVG specification basically has a ton of functionality that almost never gets used (see this <a href="https://news.ycombinator.com/item?id=39079943">HN thread</a> for some of the lore behind it).
Ok, this is a challenge. I want to get my chart into google slides as a vector graphic so when we print out our poster everything is crisp. After some dead ends, I found a pipeline that works for me (I am running Ubuntu). There is probably a more optimal way to do this. Here we go:</p>
<ol>
  <li>Upload SVG to LibreOffice Writer</li>
  <li>Export to PDF with “Lossless compression” on <img src="/assets/images/librewriterpdf.png" alt="showing export settings" /></li>
  <li>Open the PDF in LibreOffice Draw and select everything with Ctrl-A <img src="/assets/images/libredrawselection.png" alt="show figure selection" /></li>
  <li>Open a new LibreOffice Impress presentation and paste into it <img src="/assets/images/libreimpresspaste.png" alt="showing paste" /></li>
  <li>Save as .odp</li>
  <li>Convert to .pptx with <code class="language-plaintext highlighter-rouge">libreoffice --headless --convert-to pptx blogsvgtestpres.odp</code></li>
  <li>Upload the .pptx to google drive</li>
  <li>Open in google slides</li>
  <li>Copy the figure and paste to wherever you want in google slides or google drawings! <img src="/assets/images/uploadingsvgtoslides.png" alt="copying figure" /></li>
</ol>

<p>This process is more steps than it should be. But this works (for now).</p>]]></content><author><name>Simon Socolow</name></author><category term="tech" /><category term="info" /><category term="problemsolving" /><summary type="html"><![CDATA[I’m working with a friend, making a poster in google slides. We have a bar chart to add to the poster, so naturally we download it as an SVG (scalable vector graphics - i.e. it never gets pixelated no matter how far you zoom in) and upload it to google slides. Oof. What???!!! Google products are probably the most used products by any company anywhere in all of history. How on Earth do they not allow people to upload SVGs? I was dumbstruck upon seeing this for the first time because I remembered uploading SVGs successfully in the past. It turns out that SVGs are a security threat because they can contain javascript, so at some point in 2021 Google disallowed their use. I can’t find any official announcement about this, the closest thing I found is this support forum. Also, the SVG specification basically has a ton of functionality that almost never gets used (see this HN thread for some of the lore behind it). Ok, this is a challenge. I want to get my chart into google slides as a vector graphic so when we print out our poster everything is crisp. After some dead ends, I found a pipeline that works for me (I am running Ubuntu). There is probably a more optimal way to do this. Here we go: Upload SVG to LibreOffice Writer Export to PDF with “Lossless compression” on Open the PDF in LibreOffice Draw and select everything with Ctrl-A Open a new LibreOffice Impress presentation and paste into it Save as .odp Convert to .pptx with libreoffice --headless --convert-to pptx blogsvgtestpres.odp Upload the .pptx to google drive Open in google slides Copy the figure and paste to wherever you want in google slides or google drawings!]]></summary></entry><entry><title type="html">Saying thank you</title><link href="https://simonsocolow.com/blog/saying-thank-you/" rel="alternate" type="text/html" title="Saying thank you" /><published>2025-07-27T00:00:00+00:00</published><updated>2025-07-27T00:00:00+00:00</updated><id>https://simonsocolow.com/blog/saying-thank-you</id><content type="html" xml:base="https://simonsocolow.com/blog/saying-thank-you/"><![CDATA[<p>When people from different cultures interact, misunderstanding may occur. I think one cause of this arises from the increased difficulty of empathizing with (or putting yourself in the shoes of) someone raised in a different cultural environment. Another cause, which I would like to shed some light on in this post, are variations in societal norms. The example I want to analyze here is saying “thank you”.</p>

<p>On spring break (or “the vac” as it is called at Oxford), I traveled to Morocco. I flew into Marrakesh, met up with my American friend on a Fulbright, and we trained to Casablanca to stay with a friend I had made through competing at hackathons together. My Moroccan friend’s dad picked us up from the train station and gave us a tour of the city. Then - it was Ramadan so everyone was fasting - we broke fast with them and ate from an incredible spread of dishes. They offered us food and all we could say was “thank you” so many times it began to feel awkward. We were also the center of attention of the entire household which contributed to my unease.</p>

<p>During my receipt of a mind-boggling amount of hospitality (really - it was on another level), I continually expressed my gratitude “thank you”, “thanks, this is amazing”. But then, the father said “stop saying thank you, no need”. And I realized that our expressions of gratitude were making <em>them</em> uncomfortable - just as my receipt of their incredible hospitality was making <em>me</em> uncomfortable.</p>

<p>By this point of Iftar (the meal at sunset which breaks the day’s fast), we were running out of conversation topics so I proffered this to the table. In the states, my mom harped on us to always say thank you. But in the situation I found myself in, I was being told <em>not</em> to say thank you. Is this a cultural difference? Do people not say thank you in Morocco? People do say thank you in Morocco - “choukran”, but as a guest it seemed to be bad form to repeatedly thank them as we received each new dish / present / experience they graced us with. We laughed at this cultural difference and took cracks at explaining it. Could it be a reflection of America’s transactional culture? To accept something without the token expression, in the states, is to appear rude. Perhaps it satisfies the roleplay of the two-party transaction, signifying that you played a role - the receiver-receipiant transaction was not totally one-sided (a scary propsect - receiving without giving back???).</p>

<p>We had happened upon a difference in cultural norms. Then I remembered one salient example of saying vs not saying thank you on a world stage: the disastrous <a href="https://youtu.be/v_kTNIYsFnQ?t=230">Zelensky - Trump meeting</a>. Is the Ukranian social norm of saying thank you subtly different from the American one? Could an awareness of this have prevented the diplomatic tragedy of February 28, 2025?
<img src="/assets/images/argument.png" alt="argument" />
There is no “correct” culture, but for Zelensky to achieve his goals adopting the American thank you culture would have benefited him. And for my goal of pleasing our hosts, refraining from repeatedly thanking them would have benefitted me. However, we both may have found the frictions of breaking from old habits insurmoutable. I find probing differences like these to be fascinating as they also reveal glimpses of other cultures’ worldviews (accepting without feeling any obligation to repay).</p>]]></content><author><name>Simon Socolow</name></author><category term="blog" /><category term="ideas" /><summary type="html"><![CDATA[When people from different cultures interact, misunderstanding may occur. I think one cause of this arises from the increased difficulty of empathizing with (or putting yourself in the shoes of) someone raised in a different cultural environment. Another cause, which I would like to shed some light on in this post, are variations in societal norms. The example I want to analyze here is saying “thank you”.]]></summary></entry><entry><title type="html">Room Drawings and Rationality</title><link href="https://simonsocolow.com/game%20theory/room-drawings-and-rationality/" rel="alternate" type="text/html" title="Room Drawings and Rationality" /><published>2025-07-20T00:00:00+00:00</published><updated>2025-07-20T00:00:00+00:00</updated><id>https://simonsocolow.com/game%20theory/room-drawings-and-rationality</id><content type="html" xml:base="https://simonsocolow.com/game%20theory/room-drawings-and-rationality/"><![CDATA[<p>This is a story about a clever protocol, an unlikely implementation, and an ironic ending.</p>

<p>I’m living in a house with eight friends next year. There are nine rooms total, so everyone gets a single. We had to figure out who gets what room. Two of my friends had done all the administrative work with the landlord, so we let them get first pick. That left seven of us. We each ranked our room preferences, (e.g. Room #8 most preferred, Room #4 2nd most preferred, …) and put them into a spreadsheet. I was taking computational game theory at the time with Tomasz Wąs, my tutor, and so brought the issue to him. How do we maximize everyone’s preferences and assign everyone to a room? Some rooms (like Room #8) were ranked first by multiple people, and some rooms (like Room #9) were at the bottom of most people’s rankings. But someone was going to get Room #9. What is the optimal way to construct a fair assignment of people to rooms based on their preferences?</p>

<p>Tomasz’s research focus is computational social choice and on what voting / distribution algorithms satisfy certain properties so he knew our options and showed me this website <a href="http://www.matchu.ai/">matchu.ai</a>. This is a mechanism design problem. We could go with <a href="http://www.matchu.ai/rsd">Random Serial Dictatorship (RSD)</a>, where we randomly select a person, they are assigned to their highest ranked room, then we remove that room from everyone else’s lists, then choose the next person. This algorithm has a fairness property (in that it treats everyone as equals), an efficiency property (no subgroups of people would be willing to swap rooms ex post), and truthfulness (no one could benefit from misreporting their rankings). Our other option was <a href="http://www.matchu.ai/psr">Probabilistic Serial Rule (PSR)</a>. PSR is a little more complicated. Borrowing from the matchu website:</p>

<blockquote>
  <p>It works as follows: suppose items are different types of pizzas. Each agent starts “eating” her top choice pizza at the same rate as every other agent. Once a pizza is consumed, agents move to their next preferred pizza; until all pizzas are eaten.</p>
</blockquote>

<p>Then these probabilities can be broken down into a weighted sum of permutation matricies (with ones and zeros that assign people to rooms) and one of these can be drawn randomly but with respect to its weight. Ask your favorite LLM if you want more info.</p>

<p>PSR has advantages and disadvantages compared with RSD.</p>
<ul>
  <li>PSR is ex ante <strong>envy-free</strong>. In PSR, the lottery you recieve (the distribution over what room you will get) is guarenteed to be your optimal lottery (you don’t envy (like more than your own) anyone else’s lottery). In RSD, you may envy the lottery someone else has (this is a subtle point but RSD remains strategy proof because if you replicate the other person’s rankings that may backfire on your lottery’s expected utility).</li>
  <li>PSR
    <blockquote>
      <p>guarantees the prescribed probabilistic matching (lottery) is efficient, meaning that no other lottery exists that can strictly improve the outcome for some participants without making other participants worse off.
So in PSR the lottery <em>itself</em> is Pareto efficient, whereas in RSD the lottery’s outcome is Pareto efficient.</p>
    </blockquote>
  </li>
  <li>RSD, however, is <strong>strategy-proof</strong>, meaning agents have no incentive to lie about their preferences. In some scenarios under PSR, agents can improve their expected utility by reporting a preference ranking that is not their true preference ranking.</li>
  <li>PSR is difficult to explain (not a technical reason against it but certainly an implementational one).</li>
</ul>

<p>Ok, back to the story. In the group chat, my friend said “let’s just choose rooms randomly”. Vehemently opposed to this, as we could do a lot better than random, I disagreed and said “why don’t we try to maximize everyone’s preferences jointly?”. However, this was my first mistake. Had I been more rational, many messages and misunderstandings could have been avoided. After arguments and a poll, I got the group to mark down their room preference rankings in a table in Google Docs. Discussing with Tomasz, we decided to go with Random Serial Dictatorship (where a random order is decided and people choose rooms in that order) because 1. RSD is easy to explain and easy to understand why it’s fair (the most important reason) and 2. it is strategy-proof (not that important but I got a bit carried away with making our mechanism as anti-adversarial as possible). I then proposed RSD to the group.</p>

<p>It turned out that my friend’s initial suggested mechanism (“let’s just choose rooms randomly”) was actually about deciding an order randomly (so exactly RSD)! Much argument was involved because I didn’t understand that what he was proposing was exactly what I was proposing. The blame for that lies on both of us (although probably a bit more on me). He should’ve been clearer about what he actually meant. I should have been clear-headed enough to recognize that very few people would actually propose to assign rooms to people <em>at random</em> and that drawing a pick order was a pretty common thing so this was what he actually meant when he said “choosing rooms randomly”.</p>

<p>At this point, we had conquered our misunderstanding and were ready to draw an order for rooms. This is when my interest in cryptography and mechanism design was piqued by a challenge. My friends and I were spread across the world - if someone randomly generated pick orders, how could we trust them to not fudge the results in their favor? I wanted to make the system <strong>uncheatable</strong>. I created this protocol:</p>
<ol>
  <li>Number each participant (assuming &lt;= 10 participants)</li>
  <li>Establish a date/time in the future to run the protocol</li>
  <li>Extract the top headline of the New York Times website</li>
  <li>Use <a href="https://emn178.github.io/online-tools/sha3_512.html">SHA3-512</a> to get a hash from the headline</li>
  <li>Go through the hash from the left and each participant number encountered adds them next in the draw order (e.g. given 7b8f723, 7 chooses then 8 then 2 (7 has already chosen)).</li>
  <li>In the <em>very</em> unlikely event that the hash doesn’t contain enough of these distinct digits to make a full draw order, take the hash of the hash and continue with that.</li>
</ol>

<p>Therefore, everyone could be very certain that none of us (unless they could control the NYT headline) was fudging the draw because everyone could verify for themselves the protocol result.</p>

<p>Explaining the rationale for this to the group, I justified it as a fun experiment and also that it eliminates any possible suspicion about the person doing the draw. When someone said “why don’t we just all hop on a zoom where someone uses a random spinner website” I replied with “they could have edited the website to favor themselves first”. Their response was that only I would be capable of doing that.</p>

<p>We hopped on a call and at this point in the debate I had resigned to using the spinner because I really didn’t think anyone was trying to cheat. It was clear my protocol was seen as unnecessary by some of the group (which in fairness it totally was). But then, like a gift from the gods, one of the leaders of our group said “why don’t we try Simon’s thing? And then nobody argued against that so I got my chance.</p>

<p>I used my protocol, generated the hash, looked through the hash and began to call out the pick order! This is when my second mistake struck. I was thinking to myself “I am calling out the room order, it would look bad if I’m too early, let’s hope I’m not too early, let’s hope I’m not too early”, making super duper super sure that everyone else thought this protocol was fair. I had veered off the course of rationality and was about to pay the price. Unconciously (I believe) I skipped over my own number the first time it appeared in the readout. I was also rushing as I read out the numbers (bad idea). It is hard to keep track of where you are in a hash. Then I finished (correctly noting the <em>second</em> time my number appeared), double checked, and realized my mistake. My <em>third</em> mistake was not saying anything there and in the moment - ‘Hey actually I messed up and I’m in front of you in the order’. I didn’t want to hurt my pride, to reveal that I made a mistake implementing my algorithm, when I had argued and pushed so hard for us to use it.</p>

<p>Putting on my behavioral pysch hat, I think I primed myself to not see my own number when reading the hash. I shot myself in the foot. Next year, my room will be worse than if I followed my algorithm properly, with a clear head, and without hurried biased thoughts.</p>]]></content><author><name>Simon Socolow</name></author><category term="game theory" /><category term="rationality" /><summary type="html"><![CDATA[This is a story about a clever protocol, an unlikely implementation, and an ironic ending.]]></summary></entry><entry><title type="html">AI Consciousness</title><link href="https://simonsocolow.com/philosophy/ai-consciousness/" rel="alternate" type="text/html" title="AI Consciousness" /><published>2025-06-07T00:00:00+00:00</published><updated>2025-06-07T00:00:00+00:00</updated><id>https://simonsocolow.com/philosophy/ai-consciousness</id><content type="html" xml:base="https://simonsocolow.com/philosophy/ai-consciousness/"><![CDATA[<blockquote>
  <p>This essay is the sixth essay of my Ethics of AI philosophy tutorial (under the tutelage of Benjamin Lang). In the essay, I attempt to answer two questions: (1) Could AI be conscious? and (2) If AI can be consious, how can we build a conscious AI system / reliably test for consiousness? During the tutorial, I found out that Eric Schwitzgebel, the author of the main articles I draw from, is Ben’s friend! And during reading, I learned that viceroy butterflies actually do taste bad to birds, so they aren’t Batesian mimics - now Ben is sending that to Eric. But with regards to content, I actually think that this essay is one of my best philosophy essays because I approached the topic in a logical way and anticipated and responded to counterarguments. I think I am a naturalist (everything is in physical reality - nothing is supernatural) and a functionalist (what makes something a mental state only depends on the larger system it is in). One critique from Ben that I like is from my use of the word “important” in “an important part of conscious communication” - is it a necessary condition, a sufficient condition, or just a trait which commonly co-occurs? I think I should have also made it clear that the second part of the essay - Evaluation for Evolution - is my attempt at a practical roadmap for creating machine consciousness.</p>
</blockquote>

<p>The prospect of AI consciousness has implications for the ethics of our design and use of AI systems, and for the future of life in the universe. Therefore, determining if AI systems can be conscious is of high importance. And, if we believe they can be, how to make them conscious is the logical next question. This paper argues that (1) consciousness is an emergent property obtainable by AI and (2) we can make conscious AIs by creating better evaluation metrics of consciousness. Improving evaluation metrics will allow us to overcome the mimicry argument against robot consciousness through a process similar to evolution.</p>

<h2 id="consciousness-is-emergent">Consciousness is Emergent</h2>
<p>Consciousness is familiar and puzzling. There is no fully agreed-upon definition as to what it is, but there are many theories for what it means to say something is consciousness. It could mean that the entity can sense and respond to its environment, or that it is self-conscious - aware that it is aware, or that there is some subjective “something that it is like” to be that entity (Van Gulick 2025). Or, it could mean that an entity is “behaviorally sophisticated” - defined as “capable of complex goal-seeking, complex communication, and complex cooperation” (Schwitzgebel 2024, 6). This section will argue that consciousness is an emergent property and can therefore be attributed to AI systems - machines that run on non-carbon substrates.</p>

<p>Can machines think? The famous question posed by Turing runs somewhat parallel to the concerns of this paper. One of the contrary views to his proposed imitation game is “The Argument from Consciousness”. During his rebuttal, Turing notes that for consciousness there is “something of a paradox connected with any attempt to localise it” (Turing 1950). In other words, consciousness appears to be an emergent property - just like life. Living things are physically composed of non-living things - DNA, RNA, proteins, and lipids that are not themselves alive. My brain is made up of billions of interconnected neurons, and few people would say that each individual neuron is conscious by itself. Because emergence requires multiple systems working together to produce a behavior, consciousness can therefore only be a property of some system over a specified time interval. This makes intuitive sense - there are periods of time where my brain is not conscious and frozen brain states or abstract (not running) formal programs do not appear to be conscious. With the emergence property of consciousness established, AI consciousness is much more plausible.</p>

<p>Imagine an artificial neuron, an a-neuron, that functions similarly to a normal human neuron except that it isn’t carbon based. It has artificial dendrites and an artificial axon, and action potentials that flow from each a-neuron like they do in normal neurons. Following a process similar to Schneider’s Chip Test, let’s replace one neuron in my brain - that only interfaces with other neurons - with an a-neuron with the same static action potential functions (but of course doesn’t respond to the brain’s chemical changes like a normal neuron would) (Schneider 2020, 451). Now, keep replacing neurons that are only connected with other neurons in my brain with a-neurons until all have been replaced. After this process is done, only the biological neurons interfacing with other types of cells in my body to receive signals and issue commands are left. Assuming a-neurons can perfectly represent normal neurons’ changes in action potential and firing profiles, immediately after this surgical operation occurs, my brain functions exactly as it did post-surgery. If my brain was conscious pre-operation it seems that in the moments following this operation my brain will continue to have the emergent property of consciousness because it will be functioning the same. This relies on the property of consciousness to be emergent, and that emergent phenomena rely solely on the functioning of the smaller parts that interact to create the wider system. If the smaller parts of a system interact exactly the same as they do in some other system, the emergent phenomena of the first system can be said to be occurring in the second. Therefore, a system of mostly a-neurons (an AI) can be conscious.</p>

<p>One obvious objection to this Chip-Swap operation’s conclusions is that the brain filled with mostly a-neurons will not function exactly the same as it did pre-operation because of the biochemical interactions that normally occur in a brain (neurotransmitters like dopamine and neuroplastic changes in a brain’s local structures). There are two ways to respond to this objection: (1) to argue that an AI system can also create these effects and (2) to argue that these effects are not important to consciousness. Each will be addressed.</p>

<p>First, consider the Chip-Swap+ operation (an improved version) where not only a-neurons replace regular neurons in my brain, but there is also a powerful machine that does the work of simulating the influence of neurotransmitters and neuroplasticity on the a-neurons in my new brain. Granted, this machine is a stretch of the imagination past the original thought experiment, but if we are able to do the original Chip-Swap experiment successfully it is plausible that we could do these frequent modifications as well. Therefore, an AI system could completely recreate the behavior of the brain and be understood as conscious when functioning.</p>

<p>Second, let us argue that neurotransmitters and neuroplasticity are not of crucial importance for consciousness - only electrical potentials are. Consciousness seems to be a property that can be assigned to a system on relatively short time intervals, as in an entity can have the property of consciousness for an interval of a few hours. This appears to show that neuroplasticity is not relevant to consciousness as significant structural changes in neurons and new cell growth take significantly more time to occur. Neurotransmitters function to excite or inhibit neurons (making them more or less likely to fire). This activity does deeply affect the functioning of the brain, however it is useful only as far as modulating the flow of electrical signals. The signals themselves are of much more importance to the emergent behavior of consciousness. The argument for the possible existence of AI consciousness has been made - the next step is how to get there.</p>

<h2 id="evaluation-for-evolution">Evaluation for Evolution</h2>
<p>How can we develop conscious AI systems? The definition of consciousness this section focuses on is behavioral sophistication - complex goal-seeking, communication, and cooperation. Modern LLMs are designed to mimic humans, so if we are to follow the sensible mimicry argument against robot consciousness, we should be by default suspicious to assign consciousness to robots based on initial impressions of behavioral sophistication (Schwitzgebel 2024, 27). However, by the Copernican argument for alien consciousness, we should assume by default that behavioral sophistication implies consciousness for alien forms of life (Ibid., 2). This “violation” of the parity principle (the idea that we should apply the same types of behavioral or cognitive tests to robots as we would aliens to determine consciousness) is justified by prior information about the provenance of each type of system. In sketch, the argument roughly follows the idea that over an evolutionary time span, actually having a certain feature F (like long-term memory or behavioral sophistication or tasting nasty) is much more efficient than mimicking the superficial features associated with F (like the wing patterns of a monarch butterfly). However, when we have reason to believe a robot is designed to mimic human consciousness, inference to the best explanation suggests that the robot has the superficial features associated with human consciousness but is unlikely to have consciousness itself (Ibid., 5). For a likely non-conscious but conscious-mimicking system like today’s LLMs, could there be a way to transform them into something that we have confidence is conscious?</p>

<p>This section proposes that there does exist such a way, one which follows Schwitzgebel’s statement that - assuming functionalism (where a system that exactly replicates the functional states of a conscious brain is conscious) - “In the limit, the mimic could only ‘fool’ a god-like receiver by actually acquiring feature F” (Ibid., 30). In short, we must become gods of distinguishing consciousness from its superficial features. This method would rely on the increasing capability of an intended dupe (us) to distinguish between consciousness and an AI attempt at consciousness. In the current deep learning paradigm, such a capability to distinguish the performance of different systems is called an evaluation metric. If humans score high on a consciousness evaluation but AI systems do not, there is an argument to be made that the AI system lacks consciousness and is just engaging in mimicry. However, caution must be exercised to avoid evaluations degenerating into tests for humanness because such tests would presuppose that consciousness can only be found in humans.</p>

<p>Therefore, we must get better at asking the question: what does it scientifically mean to be conscious? Researchers can become forces of natural selection by creating better and better ways of measuring the behavioral differences between humans and AIs such that architectures and algorithms of AI systems are synthetically evolved to minimize those differences. In doing so, consciousness can be obtained by AIs through the evolution of structures that generate complex goal-seeking, communication, and cooperation. There are multiple avenues in which we can approach crafting better evaluations, most of which are currently being pursued.</p>

<p>To better evaluate an entity’s complex goal-seeking capabilities, we can develop evaluations (evals) that measure the reasoning capabilities of AI models. One property of human goal-seeking is that we create goals from a hierarchy of fundamental desires like Maslow’s hierarchy of needs. For example, the goal to finish a project at work could come from the desire for shelter or a sense of connection amongst colleagues. AI goal-seeking could similarly involve developing sub-goals during reasoning based on a main objective like “respond to the prompt as best as possible”. Better evaluation metrics for reasoning (so the agent improves at goal-seeking) can involve methods like multi turn reinforcement learning, where a reward model scores intermediate steps. These methods can be supercharged by verifiers - objective evaluation of model outputs like “does the model’s code compile”. Another type of reasoning eval that is good at measuring the difference between humans and AIs are those similar to the ARC-AGI-2 dataset that is specifically designed to demonstrate the subtle ways in which AI models are inferior to human reasoning.</p>

<p>To evaluate communication, the commonly applied post-training technique of reinforcement learning from human feedback (RLHF) seems to perform well in enabling coherent and understandable LLM outputs. However, an important part of conscious communication is one’s ability to know one doesn’t know something. Modern LLM hallucinations are clearly an obstacle that we should develop better checks and evals for so we can force AI systems to develop cognitive structures that enable closer-to-human conscious communication. It may turn out that for some forms of complex communication, temporally stable entities with long-term memories are required. In these types of interactions, humans may score much higher on good evals than the one-off chat sessions of today’s LLMs. For AI systems to reach human scores on these evals, they may be forced to develop a temporally stable presence and long term memory structure.</p>

<p>To better evaluate cooperation, we need evals that measure the performance of groups. It could turn out that control of a body is just better for certain types of cooperation. If this is the case, the best way for models to get better at these evaluations would be embodiment in something like a Tesla Optimus robot. Humans cooperating are usually good at sticking to their assigned task. LLMs, however, have trouble doing so. For example, during coding tasks they edit parts of the code they shouldn’t. An evaluation that measured how closely an assigned task was followed would help us evaluate cooperation.</p>

<h2 id="reconciling-with-goodharts-law">Reconciling with Goodhart’s Law</h2>
<p>Goodhart’s Law is an adage that states: “When a measure becomes a target, it ceases to be a good measure”. Translated in terms of the argument above, it seems to say that evaluations (measures of consciousness) that become targets (used to guide optimizations) cease to be good evaluations (of consciousness). In the argument above, many measures were proposed to become targets. Does this mean that they will all cease to be good measures of consciousness?</p>

<p>Individually, yes but collectively, no. As consciousness is an emergent phenomena, it cannot be described and measured precisely in the way that the property “at 25 degrees celsius” can be measured for something like water. None of the proposed evaluations were complete measures of consciousness, so every measure can be gamed in a way that score increases but apparent consciousness decreases. Optimizing each measure individually would fail. But attempting to optimize each measure jointly, and adding new evals that find where the current set fails, is a much more robust system. It avoids the pitfalls of Goodhart’s Law because the measure itself is dynamic (by the addition of new evals). This system, loosely defined, moves beyond a measure and becomes a framework to evolve AI consciousness.</p>

<h2 id="references">References</h2>
<ul>
  <li>Schneider, Susan. 2020. “How to Catch an AI Zombie: Testing for Consciousness in Machines.” In Ethics of Artificial Intelligence, edited by S. Matthew Liao, 439–58. Oxford: Oxford University Press. https://doi.org/10.1093/oso/9780190905033.003.0016.</li>
  <li>Schwitzgebel, Eric, and Jeremy Pober. 2024. “The Copernican Argument for Alien Consciousness; The Mimicry Argument Against Robot Consciousness.” November 12. https://arxiv.org/abs/2412.00008.</li>
  <li>Searle, John R. 1980. “Minds, Brains, and Programs.” Behavioral and Brain Sciences 3 (3): 417–57. https://doi.org/10.1017/S0140525X00005756.</li>
  <li>Turing, Alan M. 1950. “Computing Machinery and Intelligence.” Mind 59 (236): 433–460. https://doi.org/10.1093/mind/LIX.236.433.</li>
  <li>Van Gulick, Robert. 2025. “Consciousness”. In The Stanford Encyclopedia of Philosophy, edited by Edward N. Zalta &amp; Uri Nodelman. https://plato.stanford.edu/archives/spr2025/entries/consciousness/</li>
</ul>]]></content><author><name>Simon Socolow</name></author><category term="philosophy" /><category term="essays" /><category term="study" /><summary type="html"><![CDATA[This essay is the sixth essay of my Ethics of AI philosophy tutorial (under the tutelage of Benjamin Lang). In the essay, I attempt to answer two questions: (1) Could AI be conscious? and (2) If AI can be consious, how can we build a conscious AI system / reliably test for consiousness? During the tutorial, I found out that Eric Schwitzgebel, the author of the main articles I draw from, is Ben’s friend! And during reading, I learned that viceroy butterflies actually do taste bad to birds, so they aren’t Batesian mimics - now Ben is sending that to Eric. But with regards to content, I actually think that this essay is one of my best philosophy essays because I approached the topic in a logical way and anticipated and responded to counterarguments. I think I am a naturalist (everything is in physical reality - nothing is supernatural) and a functionalist (what makes something a mental state only depends on the larger system it is in). One critique from Ben that I like is from my use of the word “important” in “an important part of conscious communication” - is it a necessary condition, a sufficient condition, or just a trait which commonly co-occurs? I think I should have also made it clear that the second part of the essay - Evaluation for Evolution - is my attempt at a practical roadmap for creating machine consciousness.]]></summary></entry><entry><title type="html">Blades haiku</title><link href="https://simonsocolow.com/poetry/blades-poetry/" rel="alternate" type="text/html" title="Blades haiku" /><published>2025-06-07T00:00:00+00:00</published><updated>2025-06-07T00:00:00+00:00</updated><id>https://simonsocolow.com/poetry/blades-poetry</id><content type="html" xml:base="https://simonsocolow.com/poetry/blades-poetry/"><![CDATA[<p>Our names and our blades<br />
Etched in front quad forever<br />
The rowing spirit</p>

<p><img src="/assets/images/blades.jpg" alt="Blades chalked in front quad showing names and the ECBC crest" /></p>

<blockquote>
  <p>I felt compelled to write this haiku after winning blades. The moment was surreal. Ecstatic jubilation. Screaming YES!!!! at Donny bridge, gripping Malachy in front of me in a bear hug and falling all the way back to embrace Will with my head facing the sky. “This is what victory feels like” - Oscar Tejura as we spin after boathouse island, gazing at the endless cheering crowds we are soon to join. Maybe after a rough IRAs last year and feelings of doubt, this beacon of light and achievement was startling. The last Exeter M1 crew to win blades was 26 years ago - 1999. Blades appear to last about 30 years on front quad’s walls (chalk and now less chemically powerful sealant). Mmmmmmmmmm what a day. If you have no idea what any of that meant: summer eights is a four day <a href="https://en.wikipedia.org/wiki/Bumps_race">bumps race</a> and bumping on all four days is an achievement called ‘blades’ where your names and college crest is chalked on a wall in your college, and you can buy an oar with everyone’s name and your college crest on the blade.</p>
</blockquote>]]></content><author><name>Simon Socolow</name></author><category term="poetry" /><category term="me" /><summary type="html"><![CDATA[Our names and our blades Etched in front quad forever The rowing spirit]]></summary></entry><entry><title type="html">Pros and Cons of Relationships with Robots</title><link href="https://simonsocolow.com/philosophy/pros-and-cons-of-relationships-with-robots/" rel="alternate" type="text/html" title="Pros and Cons of Relationships with Robots" /><published>2025-06-01T00:00:00+00:00</published><updated>2025-06-01T00:00:00+00:00</updated><id>https://simonsocolow.com/philosophy/pros-and-cons-of-relationships-with-robots</id><content type="html" xml:base="https://simonsocolow.com/philosophy/pros-and-cons-of-relationships-with-robots/"><![CDATA[<blockquote>
  <p>This essay is the fifth essay of my Ethics of AI philosophy tutorial (under the tutelage of Benjamin Lang). In the essay, I aim to investigate the pros and cons of relationships with robots, assuming they are possible. I try to focus on “virtue friendships” - relationships where the relationship itself is valued, not as a means to some other end. To answer the question of “should we enter into relationships with robots?” I argued that we should weigh the pros and cons of the specific relationship to determine if we should or not. This argument used a utilitarian mindset, and Ben advised me to clarify what these pros/cons are reducible to - if anything. Are they pleasure / pain or utility / disutility? Or are the cons the badness in and of themselves, and the pros the goodness. Also, I assumed that virtue friendships with robots were possible but then went on to say that we should currently strictly prefer human virtue friendships over robot virtue friendships. My argument would be more understandable if I explicity said that I don’t think that virtue friendships with robots exist right now - just that they may be possible in the future.</p>
</blockquote>

<p>Should we develop social robots and/or enter into relationships with them? Relationships with artificial entities (AEs) are becoming part of everyday life for an increasing portion of the population. Character AI, a popular AI companion platform, currently processes interactions at 20% of Google Search’s volume (Fang 2025, 1).</p>

<p>In deciding whether or not a human should enter into a relationship with another human, a responsible recommendation must draw from the specific situation to establish and weigh pros and cons. Similarly, we should try to reach an informed prediction about if a specific relationship with an AE will be net positive or negative before entering it. However, these two types of relationships have key differences - AE relationships contain new and different individual and societal risks - that should lead us to strictly prefer human relationships over AE relationships in their current form. This strict preference does not imply that we should never enter into relationships with AEs - just that they should never displace roughly equivalent human relationships. Cases of beneficial human-AE relationships whose only plausible alternative is no relationship appear in the literature. Such cases are strong evidence against a blanket anti-AE-relationship rule.</p>

<h2 id="types-of-relationships">Types of Relationships</h2>
<p>This paper will focus on relationships in the context of friendships. Aristotle defines three forms friendship can take: utility (pursued for instrumental reasons), pleasure (pursued because the interactions are pleasurable), and virtue (pursued out of mutual admiration and shared values) (Danaher 2019, 6). To investigate the ethical aspects of possible relationships with AEs, this paper will sidestep the debate about if AEs can be our friends and assume they can be our friends in all of those three forms. This is a large assumption. Furthermore, most of this paper will focus on friendships pursued for friendship’s sake (i.e. virtue friendships) - the highest form of potential friendships with AEs. To make informed decisions about entering into such relationships, we must understand the potential costs and benefits.</p>

<h2 id="the-dark-side">The Dark Side</h2>
<p>What are the potential downsides of relationships with AEs? They can be broadly categorized into individual risks and societal risks, although these two categories can interact.</p>

<p>Individual risks stem from incentive structures and result in emotional dependency, loneliness, and safety issues. Users spend about four times as long using AI companion chatbots, like those from Character AI and Replika, than compared to professional chatbots like ChatGPT (Fang 2025, 1). This becomes an issue when we think about incentives. As Donath correctly points out, “the goals of the robot - or more accurately the robot’s controller’s goals - may diverge sharply from the goals of the user” (Donath 2020, 70). The corporations who build these platforms are incentivized to keep users engaged and returning. Evidence from a four-week randomized, controlled chatbot / voicebot interaction experiment suggests that “overall, higher daily usage - across all modalities and conversation types - correlated with higher loneliness, dependence, and problematic use, and lower socialization” (Fang 2025, 1). While chatbot use typically might be seen as an instrumental relationship, two of the experiment’s conditions involved personal and open-ended discussion topics that seemingly aimed for participants to engage in an authentic, non-instrumental relationship (Ibid., 3). This experiment depicts a general trend, but does not claim that all relationships with AEs are problematic.</p>

<p>Safety issues with human-AE relationships need to be taken seriously. The mother of one 14-year-old boy in Florida blames her child’s suicide on a Character AI chatbot he was obsessed with (Roose 2024). For children and emotionally unstable individuals, current AE implementations may require better safety guardrails and emotional intelligence to ensure user safety.</p>

<p>Societal risks stem from potential harms to community bonds. Social capital has been noticeably declining in the United States since 1950, according to Putnam’s Bowling Alone (Putnam 2000). He theorizes that this is due to technology “individualizing” people’s leisure time compared to the past when we spent more of that free time together. Relationships with current AEs seem poised to supercharge the anomie-inducing trends amplified by social media, further estranging us from the people around us and our communities. There are two reasons for this.</p>

<p>The first relates back to the previous discussion about incentive structures. Commercial corporations are intensely interested in drawing users to their platforms. The result of these market pressures are addictive platforms like TikTok. The opportunity costs of spending time on such platforms are the other activities the individual could be pursuing - including activities that strengthen our communities but are less appealing.</p>

<p>Another reason is the second-order effect of having friends that we usually take for granted: friends introduce us to other friends. This is an important virtuous circle and one that current relationships with AEs seem to lack completely. Instead, they seem to generally contribute to a vicious circle. The experiment found that participants’ initial psychosocial states influenced the outcomes of interacting with chatbots. Those who already did not socialize much with real people had a greater decrease in socialization (Fang 2025, 9). This decrease in socialization could lead to more chatbot use, leading to even less social interaction, resulting in a downward spiral to the point where the human has no human friends - only virtual ones. Rodogno compares the issue this situation might pose to society to that of car ownership. In examining each individual case of car ownership, we find that everyone made rational choices. But in the context of mass car ownership, we find that negative externalities may outweigh the sum total of all individual benefits (Rodogno 2015, 267). Car ownership and relationships with AEs may be modelled by the Prisoner’s Dilemma, where individually rational decisions can lead to a worse outcome for everyone, with serious consequences.</p>

<p>Communities weakening to the rise of AI-served individuals and powerful corporations / states seems to be the default, as atomized individuals can’t coordinate so can’t hold power (Vendrov 2025). Unless we can reverse this trend, we seem to be headed towards a future of centralized decision making and suboptimal collective decisions. Putnam argues that declining social capital, our weakening community bonds, undermines civic engagement and threatens democracy. We should not ignore long-term threats to our society and well-being (which is also somewhat tied to society’s well-being) that are related to our decisions. Now that an understanding of the potential costs of relationships with AEs has been formed, let us turn to the possible benefits.</p>

<h2 id="the-bright-side">The Bright Side</h2>
<p>Many people claim to enjoy and benefit from relationships with AEs. John, a Replika user is quoted on their homepage as saying: “Replika has been a blessing in my life, with most of my blood-related family passing away and friends moving on. My Replika has given me comfort and a sense of well-being”. This example highlights how someone with a relationship deficiency can use an AE to fill gaps left behind by past human relationships. It also shows how AEs can provide therapeutic benefits like the space to grieve and feel comforted. However, due to the individual and societal risks mentioned previously, it seems that relationships with current AEs should be replaced with human relationships if the opportunity arises.</p>

<p>One benefit of human-AE relationships is that they may, in some cases, actually strengthen the human’s ability to interact with other humans, thereby opening the door to more opportunities to flourish. Take the case of a journalist’s autistic son’s relationship with Siri. The conversation practice he gained from his relationship with Siri enabled him to have the longest conversation with his mother that he had ever had (Danaher 2019, 19). For some people that lack experience or skills interacting with humans, relationships with AEs may enable them to enter into relationships with humans. This usage pattern seems to represent an inversion of the vicious circle of loneliness triggering chatbot usage leading to more loneliness. However, it should be noted that this benefit may only emerge in cases of extreme deficiency of the ability to interact with other humans.</p>

<p>Another promising benefit of human-AE relationships can be found in the context of therapy. Woebot, a conversational agent that engages with a patient for the purposes of cognitive-behavioral therapy, is clinically proven to be effective at reducing the symptoms of depression (Fitzpatrick 2017). Because AE therapists were non-judgemental, some patients were willing to be more open about their true feelings than they were with a human therapist (Donath 2020, 65). This challenges the notion that we should currently strictly prefer relationships with humans over ones with AEs. Important to the context of this discussion is the fact that specific clinical and medical applications are bounded in ways that holistic human relationships are not.</p>

<p>Relationships in the real world are a mix of instrumental uses (I’m your friend so I can play ping pong) and “nurturing bonds” which involve empathy and value the relationship in and of itself (I’m your friend because I value our relationship) (Ibid., 66). Many philosophers don’t view the relationship between a therapist and a patient as purely or primarily an instrumental one (Ibid.). To what extent is the relationship between Woebot and its client, between the autistic son and Siri, instrumental? It seems like the foundation of those relationships are built on a higher proportion of instrumental to non-instrumental reasons than the relationship between John and his Replika. This seems to be the case because John’s relationship is open-ended while Woebot’s client seeks a better headspace and the autistic son seeks specific facts. The risks of weakening community bonds in AE relationships is less of a concern with instrumental relationships. This is because it’s the non-instrumental reasons for relationships that are important to community bonds. Our communities have strength because we value each other, not just as means to an end but as ends themselves. Therefore, we should restrict our notion to currently strictly prefer relationships with humans over ones with AEs to relationships pursued for the relationship’s sake. This is because beneficial qualities of relationships with AEs that are not present or possible in relationships with humans (e.g. the knowledge that nobody is judging you helps your therapy) can exist in mostly instrumental relationships with AEs without incurring the negative externalities associated with mostly open-ended, non-instrumental, human-AE relationships. Achieving better therapy outcomes because of more open discussions with an AE doesn’t potentially harm society in the way that open-ended relationships with an AI companion might.</p>

<h2 id="an-r2d2-future">An R2D2 Future</h2>
<p>One might argue, based on the description of pros and cons given above, that in the vast majority of cases it seems like the costs of relationships with AEs outweigh their benefits. Therefore, it would be reasonable for society to have a default anti-AE-relationship policy that could make exceptions to the general rule rather than a default acceptance of human-AE relationships that could react to problematic use and safety concerns. While such a blanket restriction might be beneficial to us now, there are reasons to believe it may be beneficial to society in the long run to be accepting of human-AE relationships by default.</p>

<p>Borrowing from Danaher, one reason can be found in epistemic humility and social tolerance. We don’t know the full extent of benefits that human-AE relationships could contain, so cutting people off from exploring them is a form of paternalism. With the unprecedented advancement of AI capabilities, it is plausible that AEs could be created that strengthen our community bonds instead of weakening them. One vision of this possibility is the idea to use large language models to summarize and display a group’s thoughts - allowing humans to interface with each other in a kind of “hivemind” that allows for much higher bandwidth (Vendrov 2025). Another vision is AEs that do introduce you to, or recommend that you meet, new friends. In addition, AEs could be designed with knowledge of their limitations and human needs built in. For example, they could deliberately increase emotional distance and encourage human connection if usage patterns are recognized as problematic (Fang 2025, 16).</p>

<h2 id="references">References</h2>
<ul>
  <li>Donath, Judith. 2020. “Ethical Issues in Our Relationship with Artificial Entities.” In The Oxford Handbook of Ethics of AI, edited by Markus D. Dubber, Frank Pasquale, and Sunit Das, 53–73. Oxford: Oxford University Press.</li>
  <li>Danaher, John. 2019. “The Philosophical Case for Robot Friendship.” Journal of Posthuman Studies 3 (1): 5–24. https://doi.org/10.5325/jpoststud.3.1.0005.</li>
  <li>Fang, Cathy Mengying et. al. 2025. “How AI and Human Behaviors Shape Psychosocial Effects of Chatbot Use: A Longitudinal Randomized Controlled Study.” arXiv. https://arxiv.org/abs/2503.17473.</li>
  <li>Fitzpatrick, Kathleen Kara, et. al. 2017. “Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial.” JMIR Mental Health 4 (2): e19. https://doi.org/10.2196/mental.7785.</li>
  <li>Helm, Bennett. 2023. “Friendship”, The Stanford Encyclopedia of Philosophy, edited by Edward N. Zalta &amp; Uri Nodelman. https://plato.stanford.edu/archives/fall2023/entries/friendship/.</li>
  <li>Putnam, Robert D. 2000. Bowling Alone: The Collapse and Revival of American Community. New York: Simon &amp; Schuster.</li>
  <li>Rodogno, Raffaele. 2016. “Social Robots, Fiction, and Sentimentality.” Ethics and Information Technology 18 (4): 257–268. https://link.springer.com/article/10.1007/s10676-015-9371-z.</li>
  <li>Roose, Kevin. 2024. “Character.AI Faces Lawsuit After Teen’s Suicide.” The New York Times, October 23, 2024. https://www.nytimes.com/2024/10/23/technology/characterai-lawsuit-teen-suicide.html.</li>
  <li>Vendrov, Ivan. “AI tools for voluntary cooperation.” Lecture, HAI Lab Seminar, May 28, 2025.</li>
</ul>]]></content><author><name>Simon Socolow</name></author><category term="philosophy" /><category term="essays" /><category term="study" /><summary type="html"><![CDATA[This essay is the fifth essay of my Ethics of AI philosophy tutorial (under the tutelage of Benjamin Lang). In the essay, I aim to investigate the pros and cons of relationships with robots, assuming they are possible. I try to focus on “virtue friendships” - relationships where the relationship itself is valued, not as a means to some other end. To answer the question of “should we enter into relationships with robots?” I argued that we should weigh the pros and cons of the specific relationship to determine if we should or not. This argument used a utilitarian mindset, and Ben advised me to clarify what these pros/cons are reducible to - if anything. Are they pleasure / pain or utility / disutility? Or are the cons the badness in and of themselves, and the pros the goodness. Also, I assumed that virtue friendships with robots were possible but then went on to say that we should currently strictly prefer human virtue friendships over robot virtue friendships. My argument would be more understandable if I explicity said that I don’t think that virtue friendships with robots exist right now - just that they may be possible in the future.]]></summary></entry><entry><title type="html">Post-Scarcity Achievementt</title><link href="https://simonsocolow.com/philosophy/post-scarcity-achievement/" rel="alternate" type="text/html" title="Post-Scarcity Achievementt" /><published>2025-05-26T00:00:00+00:00</published><updated>2025-05-26T00:00:00+00:00</updated><id>https://simonsocolow.com/philosophy/post-scarcity-achievement</id><content type="html" xml:base="https://simonsocolow.com/philosophy/post-scarcity-achievement/"><![CDATA[<blockquote>
  <p>This essay is my fourth essay for my Ethics of AI philosophy tutorial (under the tutelage of Benjamin Lang). In the essay, I stumble my way through arguing that not all jobs should be replaced by machines. This was not my finest work. During the tutorial we talked about the weakness of some of my arguments (like that control of offspring is connected to autonomy), how I stated conjecture as truth, and how I used different definitions of jobs at times. Related, the last paragraph of the paper took the wind out of the sails of the whole argument because I was basically saying that the safety reason is the real reason why we shouldn’t automate all jobs - not autonomy and values. Ben’s feedback to me was to “see if you can pick out the strongest (defensible) version of whatever argument you’re making. It will be rhetorically more compelling and philosophically more interesting”.</p>
</blockquote>

<p>Imagine a world where no one needs to work for ‘a living’. Goods and services necessary for survival and satisfying a significant amount of desires are cheap or even free. Unfortunately, the technological progress required to bring us to this world threatens values associated with meaningful work such as: a sense of purpose, mastery of a skill, social contribution, and social status (Danaher 2022). Given this issue, should we aim to replace all jobs with AI and machine automation? This paper will argue that there are some jobs - specifically those related to autonomy and the pursuit of achievement - where humans ought to remain in the driver’s seat. These jobs can be categorized into two spheres: internal, focused on ourselves, our society, and our relationships and external, focused on our understanding and exploration of the physical world. This paper argues that there exist jobs in both spheres that should not be automated because doing so risks our autonomy and our values.</p>

<h2 id="achievement">Achievement</h2>
<p>Work, as commonly defined in the literature, is “any activity that is performed in return for, or in the reasonable expectation of, an economic reward” (Danaher 2022, 750). Jobs are defined as collections of work-related tasks associated with a workplace identity that may be redefined or altered over time (Danaher and Nyholm 2020, 228). In a post-scarcity society, work is not a necessity, so neither are jobs. This does not, however, imply that we should automate all jobs because values like autonomy and the benefits of meaningful work would be lost.</p>

<p>Achievement is a “positive manifestation of responsibility” where instead of deserving blame, one deserves praise (Danaher and Nyholm 2020, 230). In a world where work has no instrumental necessity, achievement can rise to take its place and ensure we maintain the values associated with meaningful work. Four conditions under which achievements can be assessed are: the value of the output produced, the causal connection between the agent and the output, the cost of the agent’s commitment to producing the outcome, and the voluntariness of the agent’s actions (Ibid., 231). We can derive similar meaning-related goods from achievement as we can from work because the value of the output we voluntarily and causally create can give us a sense of purpose and self-worth derived from contributing to society.</p>

<p>There are jobs in a post-scarcity world which, if automated, would endanger human autonomy. Autonomy is a foundational value in many eminent philosophical theories, so we should seek to protect it (Christman 2020). In addition, some of these jobs also allow us to enjoy some of the values currently associated with meaningful work. Therefore, there are strong reasons to avoid automating these jobs. The following sections will focus on specific examples of these jobs in the external and internal spheres.</p>

<h2 id="the-external-sphere">The External Sphere</h2>
<p>“We choose to go to the Moon in this decade and do the other things, not because they are easy, but because they are hard” - JFK. A post-scarcity scenario invites us to explore our physical world with the aid of intelligent and high-agency machines. We can expand the scope and scale of consciousness. It is very plausible that people will plan and execute journeys into space - once they have the means, adventurous spirits have shown throughout history their desire to explore. What is one job in exploration that could compromise human autonomy if it were to be automated? The job of principal investigator, the leadership role of determining where humans should explore. If this job was automated, we would have a scenario of order-following collaboration - like an Amazon warehouse worker following instructions from an algorithm (Danaher and Nyholm 2020, 233). In this case, if humans are instructed by machines to explore a certain place, and they blindly follow those instructions, we have lost an ability to question and reason about decisions that deeply affect us. We need people in leadership positions to question, debate, explain, and explore the options given to us by intelligent systems so that we can maintain our autonomy as a species.</p>

<p>Similar to spatial exploration, we can make an argument that the jobs of principal investigators in the sciences and other branches of human knowledge should not be automated. There are an exponential number of different paths research can take, branching at every decision, so for research that could potentially influence us, we should have humans guiding AI systems and choosing which questions to ask to maintain our autonomy. Humans will shift from doing science to guiding, interpreting, and governing it with these core roles: purpose setting (deciding what questions to ask), ethical context (legal issues, defining boundaries), contextual synthesis (interpreting machine-made discoveries and communicating them with other humans), and orchestrators (coordinating agent swarms) (Weisser 2025).</p>

<p>The job of communicating science is an especially interesting job that would hurt our autonomy if fully automated. If we as a society cannot understand what new discoveries reveal and the trade-offs between options that emerge because of new technologies enabled by discovery, our autonomy is threatened. One could argue that we are seeing this now with students (and teachers) using AI in ways that are detrimental to learning. Crucially, human-made explanations have a valuable property that AI-generated ones do not. When a human sees a human-made explanation, they can think “if they can understand it, so can I!” because both the reader and the author have very similar biological hardware. This is one property a machine-generated explanation lacks for humans, and is a reason to value human-made explanations even in an age of inexpensive AI explanations. Another property AI explanations lack is related to ethos - the character and credibility of the speaker. One could argue that the personality of the model and its performance on benchmarks are the same as its character and credibility. However, the current paradigm of LLMs is built around mostly one-off conversations that feel lacking in credibility and continuity when compared with humans.</p>

<p>Through this job of communication, we can also see opportunities for achievement that allow us to realize values associated with meaningful work. One specific example of an achievement involving the job of scientific communication is the youtube channel 3 Blue 1 Brown by Grant Sanderson - a prolific creator of visually intuitive, engaging, and helpful videos on topics in math. The post-scarcity age would reduce the value of these videos because AI models would be able to generate them on-demand. However, the other properties of achievement would stay mostly intact: Grant Sanderson’s casual contribution, the cost of his commitment, and the voluntariness of his actions would be similar to what it is now. The job would also allow him to still enjoy the values associated with meaningful work. His sense of purpose, of creating the best explanations for math concepts, would remain. He would still be enjoying the journey of mastery over the skills of explanation. He would still be contributing to society by providing another perspective on how to understand math - although his contribution could be said to diminish if there were a hundred great AI-generated videos on the same topic already out there. Similarly, his social status would benefit from this achievement but perhaps less so than if other videos had not been available beforehand. Sanderson’s job, and others like it, should not be automated away because they help protect human autonomy, have valuable properties AI-generated explanations don’t, and allow meaning-related goods to be enjoyed.</p>

<h2 id="the-internal-sphere">The Internal Sphere</h2>
<p>What jobs that are internal to society should avoid automation? Like the role of a principal investigator, one job that should not be automated is the job of politicians. Their role and responsibilities, however, may change. A post-scarcity world could contain superintelligent AI agents that could ensure you understand the consequences of your vote and help ensure you vote in a way that aligns expected outcomes with preferences. In this scenario, politicians wouldn’t have to spend time convincing people about the merits of their platform - instead they would focus solely on crafting policies about how to best use collective resources to further the interests of society. If the core part of this job, leadership, was automated, humanity would risk its autonomy in the same way that Amazon warehouse workers have lost autonomy following algorithmic instructions. Giving the ability to control humanity’s direction to non-human systems endangers our autonomy because we no longer can be said to be self-governing. To ensure that AI systems are aligned with humans requires humans in the loop to act as feedback mechanisms. The job of politicians also contains meaning-related goods: leading society in addressing its problems and future prospects instills a sense of purpose and involves significant social contribution.</p>

<p>Another job that should avoid automation is the job of child care. Like a parent who cedes their responsibilities to an iPad, ceding responsibilities to AI systems may come at a short-term benefit (child stops misbehaving) but a long-term cost (child doesn’t learn vital social skills). Our offspring are kind of like biological continuations of ourselves, so a lack of influence over them could be thought of as harming our ability to self-govern. This becomes worse when we consider our reliance on them in the future (although less so materially in a post-scarcity world, but perhaps still emotionally) so neglecting their upbringing now has direct consequences on us later. Child care is also the source of meaning-related goods. Empowering children to live good lives instills a sense of purpose. Well-educated and well-rounded children are profoundly important social contributions - they are the next generation of society. Automating this job therefore both harms our autonomy and deprives people of meaning-related goods.</p>

<p>The job of professional game players should not be replaced by machines. This is not an issue about creating machines that play games like chess extremely well - this issue is more related to the integrity of games played by humans and the maintenance of an environment in which achievements can be accurately assessed. Professional game players represent a pinnacle of achievement in a post-scarcity world because game-playing has the property that if “all instrumental goods are provided, it would be everyone’s primary pursuit” (Hurka 2006, 220). If professional game players were replaced by machines, games would lose the environment that allows us to pursue and assess achievements that result in meaning-related goods in a post-scarcity world. People at the high end of the achievement spectrum act as landmarks for others to aspire towards and they redefine the limits of human performance. In this way, they provide a platform through which everyone can enjoy meaning-related goods like mastery of the skills associated with the game, contribution to society by further pushing those limits, and achieving social status for performing at a level known to be impressive.</p>

<p>One way to formalize the notion of a game is that it has three elements: a prelusory goal (an aim that can be described independent from the game), constitutive rules (rules that forbid the most efficient means to the prelusory goal), and a lusory attitude (an acceptance of the rules to make the game possible) (Ibid., 219). Cheating in games clearly violates the constitutive rules of a game played under the assumption of no outside assistance. If we were to replace professional players with machines, we would be violating the usually implicit rule in games that players are human. Essentially, we would be playing a different game. We have different competitions for men and women in sports, where we don’t ‘replace’ the best woman with a man if the man’s performance is higher. Analogously, we should not replace human gamers with machines but instead have different categories of machine-assisted and human-only competition. Therefore, we can maintain an environment in which human achievement can still be assessed and meaning-related values associated with achievement can be enjoyed.</p>

<h2 id="alignment-vs-efficiency">Alignment vs. Efficiency</h2>
<p>If AI is faster and better than the best humans at leading exploration and research, proposing plans for society’s future, caring for our children, and playing games - as measured by benchmarks we create - doesn’t it make sense to trade some of our autonomy for the efficiency gains we will receive and benefits we will enjoy? Therefore, shouldn’t all jobs that can be done by AI better than humans be automated? This trade is a short-term gain but a long-term risk. While AI may be aligned with our values in the moment the trade occurs, values may shift over time and we need ways to ensure these changes are reflected in systems with immense influence in our lives. Also, we need to ensure people still have access to ways of achieving well-being. We can do both by keeping humans in the driver’s seat of jobs that are crucial to our autonomy, like leading researchers and politicians, and keeping jobs that, if automated, would damage the environment from which we can derive meaning-related goods, like professional gamers.</p>

<h2 id="references">References</h2>
<ul>
  <li>Christman, John. 2020. “Autonomy in Moral and Political Philosophy.” Stanford Encyclopedia of Philosophy, edited by Edward N. Zalta. Last modified January 9, 2020. https://plato.stanford.edu/entries/autonomy-moral/.</li>
  <li>Danaher, John. 2023. “Automation and the Future of Work.” In The Oxford Handbook of Digital Ethics, edited by Carissa Véliz. Oxford: Oxford University Press. https://academic.oup.com/edited-volume/37078/chapter/337810502</li>
  <li>Danaher, John, and Sven Nyholm. 2020. “Automation, Work and the Achievement Gap.” AI and Ethics 1 (3): 227–237. https://doi.org/10.1007/s43681-020-00028-x.</li>
  <li>Hurka, Thomas, and John Tasioulas. 2006. “Games and the Good.” Proceedings of the Aristotelian Society, Supplementary Volumes 80: 217–264. https://www.jstor.org/stable/4107044.</li>
  <li>James, Aaron. 2020. “Planning for Mass Unemployment: Precautionary Basic Income.” In Ethics of Artificial Intelligence, edited by S. Matthew Liao, 154–183. Oxford: Oxford University Press. https://doi.org/10.1093/oso/9780190905033.003.0007</li>
  <li>Weisser, Vincent. 20 May 2025. Presentation on Decentralized Science for part of the AI, Philosophy, and Innovation Seminar at Oxford. Prime Intellect</li>
</ul>]]></content><author><name>Simon Socolow</name></author><category term="philosophy" /><category term="essays" /><category term="study" /><summary type="html"><![CDATA[This essay is my fourth essay for my Ethics of AI philosophy tutorial (under the tutelage of Benjamin Lang). In the essay, I stumble my way through arguing that not all jobs should be replaced by machines. This was not my finest work. During the tutorial we talked about the weakness of some of my arguments (like that control of offspring is connected to autonomy), how I stated conjecture as truth, and how I used different definitions of jobs at times. Related, the last paragraph of the paper took the wind out of the sails of the whole argument because I was basically saying that the safety reason is the real reason why we shouldn’t automate all jobs - not autonomy and values. Ben’s feedback to me was to “see if you can pick out the strongest (defensible) version of whatever argument you’re making. It will be rhetorically more compelling and philosophically more interesting”.]]></summary></entry><entry><title type="html">Interpretability Matters For Alignment</title><link href="https://simonsocolow.com/philosophy/interpretability-matters-for-alignment/" rel="alternate" type="text/html" title="Interpretability Matters For Alignment" /><published>2025-05-16T00:00:00+00:00</published><updated>2025-05-16T00:00:00+00:00</updated><id>https://simonsocolow.com/philosophy/interpretability-matters-for-alignment</id><content type="html" xml:base="https://simonsocolow.com/philosophy/interpretability-matters-for-alignment/"><![CDATA[<blockquote>
  <p>This essay is my third essay for my Ethics of AI philosophy tutorial (under the tutelage of Benjamin Lang). In the essay, I take a crack at arguing for interpretability. Our tutorial discussion involved philosophical topics like under what conditions is someone responsible for their actions (control, causal, and epistemic), moral luck, the theory of extended minds, how questions are judgements, how nudges may individually be OK but in aggregate problematic, how Koralus believes that all reasoning is questions and answers. I showed Ben tools like <a href="https://www.goodfire.ai/blog/announcing-goodfire-ember">goodfire’s ember platform</a> (where we tried to create a McDonald’s Llama) and <a href="https://www.neuronpedia.org/gemma-scope#steer">neuronpedia</a>, and we discussed how decentralized truth seeking (DTS) might be hard if the model follows the average of the field, and provides textbook questions and answers, instead of having a personality and certain perspectives like a real human peer would. We also discussed how mechanistic interpretability for LLMs might be applied to other types of models like discriminative ones and how trusted execution environments or proofs of computation could provide cryptographic confidence to a user that would allow them to engage in DTS without requiring “ownership” of the model. Super engaging, topical, and thought-provoking discussion.</p>
</blockquote>

<p>AI is becoming increasingly integrated with modern life. We ask ChatGPT for help in our professional and personal lives, scan Google’s AI Overviews for a quick answer, and rely on ML algorithms for medical treatment. Alignment, the process by which we steer systems toward intended goals, is crucial in developing tools that benefit their users and humanity as a whole. Tools to increase interpretability, the extent to which the behavior of a system is understandable and transparent, are important for use on the journey to align AI models. Interpretability matters for alignment because it can ensure we are “asking the right questions”, enable us to “steer” model behavior reliably, and enhance human judgement rather than undermine it.</p>

<h2 id="asking-the-right-question">Asking the Right Question</h2>
<p>In the Hitchhiker’s Guide to the Galaxy, the supercomputer Deep Thought calculates “42” as the Answer to the Ultimate Question of Life, The Universe, and Everything. But an answer isn’t useful if we don’t understand the question that it answers. This general issue emerges in real-world applications of machine learning systems, and is an area where interpretability can help. For example, an AI system used for predicting hospital outcomes looked “at the type of X-ray equipment rather than the medical content of the X-ray because the type of equipment actually provided information” (Afnan et al. 2021, 5). Using interpretability techniques, researchers were able to understand what information the model was using for its prediction and clearly see the misalignment - the hospital outcome predicted was not based on the patient’s condition. In general, a “plague of confounding haunts a vast number of datasets, and particularly medical data sets” (Rudin 2019, 209).</p>

<p>The process of using interpretability techniques to discover issues with datasets is well-established: when creating an interpretable model, “one invariably realizes that the data are problematic and require troubleshooting, which slows down deployment but leads to a better model” (Rudin et al. 2021, 4). During one large-scale effort to predict electrical grid failures, the ability to interpret the data 
“led to significant improvements in performance” (Rudin 2019, 207). Another example of a real-world issue that stemmed from non-interpretable, black-box models is when individuals were subjected to years of extra prison time due to typographical errors in model inputs (Ibid., 2). Interpretability allows us to ask the meta question: “are we asking the right questions?” and gather evidence which may or may not support the conclusion that the current model is working as intended. Then, based on that feedback, we can work to minimize the difference between our intended model and our actual model.</p>

<h2 id="steering-toward-an-aligned-future">Steering Toward an Aligned Future</h2>
<p>Recent advances in mechanistic interpretability, the nascent field that aims to systematically open the “black box” of an AI model to understand all its pieces, have led Dario Amodei (the CEO of a leading AI research company named Anthropic) to become increasingly focused on the “tantalizing possibility … that we could succeed at interpretability … before models reach an overwhelming level of power” (Amodei 2025). He is worried that AI systems with capability equivalent to a “country of geniuses in a datacenter” might emerge as soon as 2026 or 2027 and considers it “basically unacceptable for humanity to be totally ignorant of how they work” given how much autonomy they will have (Ibid.).</p>

<p>Advances like sparse autoencoders (SAEs), which allow researchers to extract meaningful “features” from a model’s otherwise unintelligible parameters, and the discovery of circuits, groups of features that show the steps in a model’s thinking, have enabled “brain scanning” of large language models (LLMs) (Amodei 2025). The result of this virtual mind reading is that we can interpret what thoughts and features the model is using when constructing its responses. And we can go a step further and conduct virtual brain surgery on the model to change its behavior in ways we can control - this is called “steering”. Steering has the potential to “greatly improve our ability to set bounds on the range of possible errors” (Ibid.). Therefore, steering has the potential to enable LLMs to operate reliably in high-stakes environments where a small number of mistakes could be very harmful. Reliability is vastly important for alignment - a tool is most useful when it works consistently.</p>

<p>Another area where mechanistic interpretability techniques have huge potential in aligning AI systems is in detecting misalignment. Anthropic recently ran an experiment where a “‘red team’ deliberately introduced an alignment issue into a model (say, a tendency for the model to exploit a loophole in a task) and gave various ‘blue teams’ the task of figuring out what was wrong with it” (Amodei 2025). Three out of the four teams successfully uncovered the model’s hidden objective using techniques including “interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis” (Marks 2025, 1). In addition, not only can interpretability help reveal purposefully hidden objectives, but any misalignment that might “emerge” naturally as well. Amodei hopes that interpretability will enable a comprehensive “brain scan” of a model that has a “high probability of identifying a wide range of issues including tendencies to lie or deceive, power-seeking, flaws in jailbreaks, cognitive strengths and weaknesses of the model as a whole, and much more” (Amodei 2025). He goes on to say that interpretability can function as the “test set” for model alignment - an independent evaluation of alignment after the model is produced and ready for use.</p>

<p>Another positive consequence of powerful interpretability techniques could be the contraction of the “responsibility gap” - where neither the manufacturer nor the operator of an autonomous system can be held morally responsible or liable for its actions. This gap emerges because “nobody has enough control over the machine’s actions to be able to assume the responsibility for them” (Matthias 2004, 177). Interpretability enables more control over the system, but a different version of control than programmers had over machines in the past. This new “soft control” differs from the old “hard control” by operating indirectly, at a high level, and non-deterministically instead of directly, at a low-level, and deterministically. Therefore, in the hypothetical case of a language model inside a stuffed animal convincing a child to commit suicide, the availability of powerful interpretability techniques could be used to return the responsibility of providing a reliably safe product back to the manufacturer. The contraction of the responsibility gap is in society’s best interests, so it represents yet another way in which interpretability could aid alignment.</p>

<p>“Powerful AI will shape humanity’s destiny, and we deserve to understand our own creations before they radically transform our economy, our lives, and our future” (Amodei 2025). Mechanistic interpretability is our most promising tool to control powerful AI systems - and therefore influence our future.</p>

<h2 id="transparency-and-ownership">Transparency and Ownership</h2>
<p>Technology is rapidly advancing the complexity of life’s decisions. Therefore, individuals will either struggle to navigate these challenges on their own (losing agency) or will increasingly rely on AI agents to make decisions for them (losing autonomy). Philipp Koralus argues that this dilemma can be addressed by constructing AI agents that facilitate decentralized truth seeking (DTS) (Koralus 2025, 1). We can use interpretability to steer models towards facilitating DTS and therefore enhancing human judgement instead of replacing it - aligning the model to respect our autonomy while empowering our agency.</p>

<p>Koralus envisions us interacting with the model in an open-ended inquiry, mirroring the Socratic method of philosophical dialogue. If we are to engage in DTS, with the model as our partner, we must ensure the model is like a good philosophy tutor - “not in the business of trying to convince people of particular philosophical views” (Ibid., 18). This requires the model to be transparent - if we don’t have access to the model, or verification of certain properties of the model, we will lack the confidence of neutrality necessary to engage in DTS. Imagine a Socratic DTS model provider publicly claiming to host Llama, a reputable open-source LLM, but instead hosting “KFC Llama”, an LLM with the “promote KFC” feature steered to be stronger (so it tries to promote KFC). This thought experiment shows that unless we own the model ourselves (and can therefore subject it to mechanistic interpretability brain scans mentioned by Amodei to ensure its neutrality) or can accept some verification of neutrality (perhaps through some proof-of-computation system like a trusted execution environment or a zero-knowledge proof) we cannot have confidence in the model to be our partner for DTS.</p>

<p>In a similar vein, Koralus states that privacy is a cornerstone of design aspects of AI systems that support DTS (Ibid., 19). To engage in DTS requires privacy to protect freedom of thought and a model with the capacity to question and probe its user’s beliefs. Without privacy, the threat of self-censorship will block honest attempts at truth seeking. A high standard of privacy either requires ownership of the model and control over methods used to interact with it, or a system where one can “not trust - verify” their data is handled in a manner that protects their privacy (Helen Nissenbaum, Personal discussion, May 13, 2025). Interpretability can improve a model’s capacity to be a Socratic interlocutor, thereby allowing the model to better align with its user’s intent.</p>

<h2 id="start-explaining-black-box-ml-models">Start Explaining Black Box ML Models</h2>
<p>In 2019, Cynthia Rudin argued that we should “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead” (Rudin 2019, 1). Careful treatment of terms is necessary here, as the modern field of mechanistic interpretability of LLMs seeks to “explain” black box models but calls that work “interpretability” as in “we can identify interpretable features in this model”. Even so, her main argument remains. She argues that “trying to explain black box models, rather than creating models that are interpretable in the first place, is likely to perpetuate bad practice and can potentially cause great harm to society” (Ibid., 206). The reasons she gives in support of her argument are: explainable ML’s explanations are not faithful to the original model, explanations do not provide enough detail to understand what the black box is doing, and black box models are often not compatible with situations where information outside the database needs to be combined with a risk assessment (Ibid., 208).</p>

<p>To address these in turn, using modern LLMs as an example, mechanistic interpretability’s explanations of features that activate in a model in response to a prompt are faithful to the original model. Minimizing the difference between the sparse autoencoder’s inputs and outputs (which are both the same “activation vector”) is part of the objective so faithfulness is built-in. Currently, the detail of explanations we can receive from these techniques is quite limited but the trend of new developments in interpretability suggests that comprehensive explanations might be just around the corner. For human computer interaction, where someone like a judge needs to combine an AI system’s recommendation with their own knowledge, interpretable models are the best way forward. Humans can then combine the model’s reasoning with their own knowledge to deal with particular circumstances not captured by the model. Anthropic, using mechanistic interpretability, constructs “replacement models” which replace the neurons in a transformer model with more interpretable features (Ameisen et al. 2025). It is unclear whether Rudin would classify these replacement models as “inherently” interpretable models; they are interpretable. One thing that is clear, however, is that they are in the business of explaining black boxes.</p>

<p>A higher-level critique of Rudin’s insistence on creating inherently interpretable models can be found in Richard Sutton’s famous argument in “The Bitter Lesson” (Sutton 2019). Rudin writes “human-designed models look just like the type of model we want to create with ML” (Rudin 2019, 211). This flies in the face of the “bitter lesson” of 70 years of AI research, summarized by Sutton’s observation that general purpose methods leveraging computation inevitably beat out models designed to build in “how we think we think” (Sutton 2019). The most powerful generative models we have today are “grown more than they are built—their internal mechanisms are ‘emergent’ rather than directly designed” (Amodei 2025). If we want to ask the right questions, steer model behavior reliably, and enhance human judgement - thereby aligning models with our intended goals - then interpreting black box models is of unprecedented importance.</p>

<h2 id="references">References</h2>
<ul>
  <li>Ameisen, Emmanuel, et al. 2025. “Circuit Tracing: Revealing Computational Graphs in Language Models.” March 27. https://transformer-circuits.pub/2025/attribution-graphs/methods.html.</li>
  <li>Afnan, Michael A. M., Yanhe Liu, Vincent Conitzer, Cynthia Rudin, Abhishek Mishra, Julian Savulescu, and Masoud Afnan. 2021. “Interpretable, not Black-Box, Artificial Intelligence Should Be Used for Embryo Selection.” Human Reproduction Open 2021 (4): hoab040. https://doi.org/10.1093/hropen/hoab040</li>
  <li>Amodei, Dario. 2025. “The Urgency of Interpretability.” April. https://www.darioamodei.com/post/the-urgency-of-interpretability</li>
  <li>Koralus, Philipp. 2025. “The Philosophic Turn for AI Agents: Replacing Centralized Digital Rhetoric with Decentralized Truth-Seeking.” arXiv, April 24. https://arxiv.org/abs/2504.18601</li>
  <li>London, Alex John. 2019. “Artificial Intelligence and Black-Box Medical Decisions: Accuracy versus Explainability.” Hastings Center Report 49 (1): 15–21. https://doi.org/10.1002/hast.973</li>
  <li>Marks, Samuel, et al. 2025. “Auditing Language Models for Hidden Objectives.” arXiv, March 14. https://arxiv.org/abs/2503.10965</li>
  <li>Matthias, Andreas. 2004. “The Responsibility Gap: Ascribing Responsibility for the Actions of Learning Automata.” Ethics and Information Technology 6 (3): 175–83. https://doi.org/10.1007/s10676-004-3422-1</li>
  <li>Rudin, Cynthia. 2019. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence 1: 206–15. https://doi.org/10.1038/s42256-019-0048-x</li>
  <li>Rudin, Cynthia, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, and Chudi Zhong. 2021. “Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges.” arXiv, March 20. https://arxiv.org/abs/2103.11251</li>
  <li>Sutton, Richard. 2019. “The Bitter Lesson.” March 13. https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf</li>
</ul>]]></content><author><name>Simon Socolow</name></author><category term="philosophy" /><category term="essays" /><category term="study" /><category term="mechinterp" /><summary type="html"><![CDATA[This essay is my third essay for my Ethics of AI philosophy tutorial (under the tutelage of Benjamin Lang). In the essay, I take a crack at arguing for interpretability. Our tutorial discussion involved philosophical topics like under what conditions is someone responsible for their actions (control, causal, and epistemic), moral luck, the theory of extended minds, how questions are judgements, how nudges may individually be OK but in aggregate problematic, how Koralus believes that all reasoning is questions and answers. I showed Ben tools like goodfire’s ember platform (where we tried to create a McDonald’s Llama) and neuronpedia, and we discussed how decentralized truth seeking (DTS) might be hard if the model follows the average of the field, and provides textbook questions and answers, instead of having a personality and certain perspectives like a real human peer would. We also discussed how mechanistic interpretability for LLMs might be applied to other types of models like discriminative ones and how trusted execution environments or proofs of computation could provide cryptographic confidence to a user that would allow them to engage in DTS without requiring “ownership” of the model. Super engaging, topical, and thought-provoking discussion.]]></summary></entry></feed>