Songs of Theatre and Robots – How AI Voices Cope in Theatre Settings

Marius-Alexandru Teodorescu*

Abstract

As synthetic voices become increasingly sophisticated, their presence in contemporary theatre raises urgent questions about authenticity, embodiment, and emotional labor. This research investigates how AI-generated vocal technologies are used to perform emotionally charged texts on stage, proposing a critical framework for assessing the affective and dramaturgical implications of disembodied voice performance. The study offers a comparative analysis of four leading AI voice tools—ElevenLabs, Descript Overdub, Resemble.ai, and Voice.ai—examining their technical capabilities, emotional nuance, and theatrical usability. Through practice-based research, it explores both live and recorded performances that integrate synthetic voices as characters, narrative agents, or scenographic elements, including digital theatre voiceovers, documentary revoicings, and posthuman monologues. Beyond the technical, the article addresses pressing ethical and aesthetic concerns: Who owns a voice? What are the implications of algorithmically reviving the voices of deceased actors? Can synthetic vocality evoke genuine empathy, or does it risk uncanny alienation? Engaging with voice theory (Dolar, Connor), affect studies, and posthuman dramaturgy, this paper argues that synthetic voice is not merely a tool, but a performative agent—capable of disrupting and expanding how theatre negotiates presence, labor, and authenticity in the age of AI.

Keywords: synthetic voice, AI in theatre, Disembodied performance, posthuman dramaturgy, emotional labor, voice technology, affect, authenticity

Introduction

In the last five years, commercial voice synthesis has moved with startling speed from experimental laboratories into everyday interfaces. Theatre has begun to appropriate these tools not merely as backstage utilities for announcements or dubbing, but as active agents in the composition of presence on stage. At the same time, scholarship on theatre and performance has only started to register the aesthetic and ethical consequences of treating synthetic voices as performers rather than as neutral media. This article proposes that contemporary AI voice platforms – ElevenLabs, Descript Overdub, Resemble, and Voice.ai – should be understood as performative agents whose affordances, glitches, and limitations are fundamentally dramaturgical questions rather than purely technical ones.

My argument emerges from my dual position as theatre director and researcher, organized around what I call the five splits of theatrical voice. For much of theatre history, voice functioned as the audible extension of a present, laboring body: breath turned into story inside the architectural instrument of the theatre. The introduction of the microphone in the late nineteenth century, the development of the mixing board in the mid-twentieth century, and recording and spatial audio technologies in the late twentieth century progressively separated what had seemed indivisible.

The last of these splits is introduced by AI voice production in performance. Theatre scholarship, however, still lacks detailed accounts of how specific AI platforms behave under theatrical pressure. Existing work on digital performance, sound studies, and voice theory has produced rich conceptual vocabularies for disembodiment, mediation, and posthuman dramaturgy. Yet there is comparatively little empirical research that compares commercial AI voice tools across different dramaturgical registers and languages, or that asks which platforms are “good” not in an abstract sense, but for particular kinds of plays, audiences, and ethical stakes.

While the article briefly revisits the earlier project Lear is Blind – a blindfolded adaptation of King Lear in which spatial audio and live processing already exploited the first four splits – the present focus is not on that production as such, but on the comparative behaviour of AI platforms when inserted into theatrical situations. Lear is Blind functions primarily as a pre-AI case study that unintentionally anticipated the fifth split: in one sequence, a performer who could not be present was replaced by pre-recorded lines triggered live, supported by tactile cues for the audience. When spectators failed to notice the substitution, it became clear that “authenticity” was being negotiated relationally, through multimodal cues rather than through verifiable biological origin. Synthetic voices radicalise this insight: they do not merely replay the absent actor, they can in principle replace the need for any originating human voice at all.

Against this backdrop, the central questions of the article are threefold.

First, what kinds of presence do contemporary AI voice platforms produce when evaluated under theatrical rather than purely technical criteria?

Second, how do their specific affordances and breakdowns map onto different dramaturgies and languages?

Third, what practical-ethical frameworks become necessary once these platforms are treated as quasi-performers whose labour, ownership, and legibility to audiences must be negotiated?

1. The Four Historical Splits of Theatrical Voice

The notion of splitting the theatrical voice begins from the apparently self-evident identity that dominated theatre for millennia: voice = body. In pre-electric theatres and in the architectural acoustics of Greek amphitheatres, the actor’s diaphragm functioned as the primary technology of projection (along with the mask’s amplification effect); vocal presence was inseparable from muscular labour and from the visible body that produced it. In theatre, the voice is a key instrument of meaning-making, despite Dolar’s assertion that “Hence we can put forward a provisional definition of the voice (in its linguistic aspect): it is what does not contribute to making sense” (24).

Each split names not merely a technical possibility but a new dramaturgical condition: a way in which sound design and direction can reconfigure the relation between bodies, architectures, and spectators.

1.1 Effort versus Effect

The first split emerges with the widespread use of the microphone in late-nineteenth and early-twentieth-century theatre. Before amplification, acoustic effect was tightly coupled to physical expenditure: to sound louder was to work harder, “your body paid the price.” Projection training codified this economy of effort; the actor’s authority in space depended on the disciplined control of breath and resonance.

With the microphone, this economy is broken. A whisper can now fill the room, while a shouted line can be tamed into intimate proximity. The relationship between vocal effort and acoustic effect becomes, in principle, arbitrary. From a dramaturgical perspective, this enables subtle work with exhaustion, fragility, or aging: a physically depleted character can retain sonic power, or conversely, a seemingly robust body can be rendered acoustically small. What I call the effort/effect split therefore designates the technological uncoupling of audible volume and timbral presence from visible bodily strain.

This split has become so naturalised that it often escapes notice; actors and audiences no longer register the disjunction as such. Yet it constitutes the first step in a longer process by which voice is detached from the constraints of the performing body.

1.2 Actual Voice versus Audible Voice

The second split depends on the mixing board and the development of real-time audio processing. Once the voice’s electrical signal can be routed through equalisation, compression, reverberation, and pitch-shifting, the audible voice heard by the audience no longer corresponds straightforwardly to the actual voice produced in the actor’s throat.

In practice, this allows for techniques such as live pitch-shifting to create multiple characters from a single performer: I pitch-shift an actor down an octave, and suddenly they’re playing their own father. From the audience’s perspective, a voice may be deepened, gendered, or texturised in ways that never actually existed in the air, but only in the mediated signal chain between microphone and loudspeaker.

Theoretically, this second split connects to Steven Connor’s account of ventriloquism as a cultural practice in which the voice is experienced as something “produced differently,” no longer a mere condition of being but an active, manipulable production. It also resonates with Mladen Dolar’s description of the voice as an “excess” in relation to speech and meaning, the element that seizes attention precisely because it seems to condense “quintessential humanity” even when poorly reproduced. In both cases, the voice appears as something that can be sculpted, intensified, or estranged from the body that ostensibly owns it.

For theatre practice, the actual/audible split becomes a compositional resource. Directors can design voices that no actor could physically produce, or that dislocate social markers such as age, gender, or species. The mixing board becomes an instrument of vocal dramaturgy, not just technical support.

1.3 Actual Location versus Perceived Location

The third split arises with multi-channel sound systems and what is now commonly called spatial audio. Here the disjunction is between the physical location of the speaking body and the perceived location of the voice. Surround sound allows a performer to stand in one place while their voice “walks the balcony,” or circulates around the auditorium.

In my own production Lear is Blind (MISE EN SCENE NGO, Cluj-Napoca, 2024), this split was central to the audience’s experience. This was a production created for blind audiences, in which sighted spectators were blindfolded and each held a tactile model of the stage; through carefully programmed speaker arrays, they could “see” the actors’ movements on their miniature stage solely through the trajectories of voices and sound effects. One performer, for instance, stood perfectly still while their voice “ran circles around the audience,” crafting the Fool as a figure who is “everywhere and nowhere.” This shifted the function of sound in the show, echoing Rea Dennis and Kate Hunter’s observation that “[the production] positions the role of the sonic in the theatre space beyond soundtrack or mood-setting device, re-casting it as agential partner with its own vibrant characteristics” (45).

*Lear is Blind* (MISE EN SCENE NGO, Cluj-Napoca, 2024). Blindfolded audience members holding tactile stage models during the performance’s spatial-audio sequences. Photo: MISE EN SCENE NGO

The location/perception split thus allows theatre-makers to reassign mobility: the voice becomes the element that travels, while the body may remain fixed or even absent from the visible playing space. For blind or blindfolded audiences, this split is not merely an effect but an entire mode of spatial cognition; for sighted audiences, it can destabilise the presumed anchoring of voice in a clearly localised body. These splits created an increasingly complex distinction between voice and speaker, producing varying degrees of acousmaticity. As defined by Dolar, “The acousmatic voice is simply a voice whose source one cannot see, a voice whose origin cannot be identified, a voice one cannot place” (70).

1.4 Time of Speaking versus Time of Hearing

The fourth split is produced by recording technologies and the widespread practice of integrating pre-recorded material into live performance. Here the difference lies between the moment at which a line is spoken and the moment at which it is heard. The rehearsal room can be “folded into” the performance through pre-recorded lines, loops, or soundscapes triggered live. “Yesterday’s rehearsal appears in tonight’s show,” and “the temporal link is broken.”

In Lear is Blind, this split emerged not only as an abstract possibility but as a practical solution: when one actor could not be present for all performances, her lines were recorded line by line and triggered from the booth, while fellow performers provided tactile contact at the appropriate moments to simulate her physical presence. The crucial observation is that nobody noticed. For the audience, the combination of recorded voice, live timing, and embodied touch sufficed to sustain the assumption of liveness.

This situation exemplifies what Muskaan describes as the relational and performative construction of authenticity, in which spectators infer the “realness” of a presence from affective and multimodal cues rather than from any direct access to biological origin (5893). The time-of-speaking/time-of-hearing split therefore opens a dramaturgical field in which temporal distance is aestheticised: voices can return as ghosts, memories, or recursive echoes of past performances.

DAW session architecture from *Lear is Blind* (MISE EN SCENE NGO, Cluj-Napoca, 2024), illustrating the layered organisation of prerecorded voice, music, spatial audio, and live-triggered cues used during performance. Screenshot by the author

2. The Fifth Split: Speaker versus Non-Person and the Performative Agent

The microphone, mixing board, spatial audio system and playback device all intervene in how a human voice is distributed and transformed. AI voice models intervene in who is understood to be speaking. This is what I call the fifth split: speaker versus non-person.

This new way of producing voice challenges the notion of performance articulated by Peggy Phelan, who argues that “Performance’s only life is in the present. Performance cannot be saved, recorded, documented, or otherwise participate in the circulation of representations of representations” (146–47). AI voice synthesis platforms such as ElevenLabs, Descript Overdub, Resemble and Voice.ai destabilise this division. They offer voice models: parametric entities that can be instructed to produce lines at any time, in multiple languages and affective registers, without fatigue, rehearsal or physical presence. These models may be trained on recordings of a specific actor, assembled from generic datasets, or marketed as “stock” synthetic voices. In all cases, what appears on stage is no longer the sound of “someone speaking” in the usual sense, but the output of a non-person – an artefact of machine learning procedures applied to vocal data.

The fifth split thus names the separation between the speaker as a dramaturgical function (the entity to whom we attribute lines, intentions, and sometimes responsibility) and the speaker as a legal or biological person. Synthetic voices can produce recognisably human prosody and emotion without being persons; conversely, human persons can find that their vocal traces have been abstracted into models that continue to “speak” long after recording has ended. This tension is particularly acute in theatre, where the politics of liveness, authorship and embodiment have historically depended on the assumption that performance involves the appearance of persons before others.

To conceptualise this new condition, I propose the term performative agent. A performative agent is an entity on stage that:

Produces performative effects – it delivers lines, cues, or sonic actions that shape the spectators’ experience and the unfolding of the performance;
Responds to some form of direction – its outputs can be modulated through textual prompts, parameters or triggers, even if this responsiveness is limited compared to a human actor’s;
Lacks legal and biological personhood – it is not a worker in the labour-law sense, cannot consent or refuse, and does not have a “private life” outside performance.

Performative agents include AI voice models, but may also encompass other algorithmic systems (e.g., generative lighting states or reactive soundscapes controlled by sensors). The key point is that they occupy a grey zone between actor and sound cue. Like actors, they can sustain characters, emotions and relationships over time; like cues, their behaviour is ultimately constrained by software and the person(s) who operate it.

Diagram of the five historical splits of theatrical voice proposed in this article. Diagram by the author

3. Methodology: A Comparative Torture Test of AI Voice Platforms

The empirical core of this article is a comparative torture test of four AI voice platforms widely used in creative industries: ElevenLabs, Descript Overdub, Resemble and Voice.ai. In 2021, Knight noted that “It’s still difficult to maintain the realism of a voice over the long stretches of time that might be required for an audiobook or podcast. And there’s little ability to control an AI voice’s performance in the same way a director can guide a human performer.” However, we found the situation has radically changed during the past four years. The goal was not to produce an exhaustive technical benchmark, but to observe how these platforms behave when subjected to theatrical pressures: demanding texts, multilingual requirements, emotionally charged scenes, and the practical constraints of rehearsal and small- to mid-scale production.

3.1 Research Design and Questions

The study was designed around three guiding questions:

How do contemporary AI voice platforms perform across distinct dramaturgical regimes?
Do they handle Shakespearean verse differently from psychological realism, absurdist dialogue or contemporary teen theatre?
How robust are these platforms when confronted with multilingual demands?
Can a single model plausibly sustain the same character across Romanian, English, French and Hungarian, or do certain languages consistently produce breakdowns, mispronunciations or flat prosody?
What platform-specific affordances and limitations emerge when we treat these tools as performative agents in rehearsal conditions?
How quickly can directors iterate, “direct” the synthetic voice through parameters and prompts, and integrate the results into existing sound setups?

The term “torture test” is used deliberately: rather than composing texts that flatter the platforms (short, neutral sentences in dominant languages), the experiment uses material that is structurally and affectively demanding, closer to what theatre practice actually requires.

3.2 Dramaturgical and Linguistic Corpus

The testing corpus comprises four dramaturgies, each represented by a scene or sequence already familiar from my directing work:

Shakespearean verse – a passage from King Lear, in which meter, rhetorical patterning and rapid emotional shifts pose challenges for timing and prosody.
Psychological realism – a scene from Hedda Gabler, focusing on subtext, hesitation and micro-modulations of tone rather than overt emotional display.
Absurd theatre – an extract from Eugène Ionesco’s Rhinocéros, where repetition, illogical escalation and tonal disjunctions test the platform’s ability to maintain coherence while delivering increasingly unhinged material.
Teen theatre – an original text written in contemporary colloquial language, including overlaps, incomplete sentences and slang, targeting audiences of adolescents and young adults, based on my production of “Bine, Bă” (Învață de mic NGO, Cluj-Napoca, 2025).

Each of these four dramaturgies was rendered in four languages: Romanian, English, French and Hungarian. These languages were chosen not to represent a global sample, but because they correspond to linguistic contexts in which I have worked as a director or collaborator and can therefore evaluate nuance, timing and accent with some precision. The combination yields a matrix of 16 text–language conditions (4 dramaturgies × 4 languages), each of which was run through each platform.

*Bine, Bă* (Învață de mic NGO, Cluj-Napoca, 2025), used as part of the teen theatre corpus in the AI voice platform tests. Photo: ÎNVAȚĂ DE MIC NGO

3.3 Platforms and Voice Configurations

The four platforms were selected because they represent different positions in the current ecosystem:

ElevenLabs as a leading cloud-based provider emphasizing lifelike prosody and multilingual support;
Descript Overdub and Resemble as tools integrated into broader audio-editing and dubbing workflows;
Voice.ai as a real-time voice-changing solution often used in gaming and streaming contexts.

For this experiment, we have chosen to only use stock voices. Using custom voices will be the base of a different study we intend to create in order to explore the expressive qualities of AI across a variety of platforms.

ElevenLabs voice-generation interface used in the Shakespearean verse test during the comparative platform evaluation. Screenshot by the author

3.4 Scoring Criteria and Evaluation Procedure

Each platform’s output for a given text–language condition was evaluated on a 0–5 scale along several criteria:

Intelligibility – clarity of articulation, absence of slurring or clipping, and recognisability of words for a fluent listener in that language.
Prosody and rhythm – sensitivity to punctuation, pauses, enjambment and meter (especially for Shakespeare), including appropriate variation in pitch and tempo.
Emotional modulation – capacity to render changes in affect (e.g., irony, anger, resignation) without resorting to caricature or remaining monotonous.
Multilingual robustness – handling of non-English phonetics, names and syntax; presence or absence of strong “English accent” colouring in other languages.
Responsiveness to direction – how much the output could be improved by adjusting platform-specific settings (e.g., “stability,” “style,” “emotion,” “speed”) or rephrasing prompts, without re-recording or external processing.

3.5 Procedure and Rehearsal-Like Conditions

To approximate rehearsal conditions, the tests were conducted under constraints typical of small- and mid-scale productions:

Limited time per scene – each text–language–platform combination was allocated a modest window for configuration and iteration, reflecting the reality that sound teams rarely have days to fine-tune a single cue.
Standard hardware and internet connections – no dedicated servers or specialised studio equipment were used beyond a reasonably powerful laptop and headphones.
No external post-processing – except where the platform itself offers integrated effects, the outputs were not subsequently polished in separate digital audio workstations. This isolates the platform’s own capabilities as a performative agent.

4. Findings: Platform Behaviour Across Dramaturgies and Languages

The torture test revealed not one hierarchy of “best to worst,” but a set of distinctive behavioural profiles—each platform exhibiting strengths and weaknesses that align differently with dramaturgical forms and languages. What follows synthesises these tendencies into a comparative analysis structured around the four dramaturgies (Shakespearean verse, psychological realism, absurd theatre, teen theatre) and the four languages tested (RO/EN/FR/HU). The aim is not to universalise these results but to articulate practitioners’ knowledge: what a director or sound designer might reasonably expect when deploying each tool as a performative agent in rehearsal or performance.

Our test came to prove again the rapidly developing quality of AI voices. With platforms such as ElevenLabs, our findings align with studies using fake-media in scams, such as the one done by Barrington et al., who found that “When the audio clip contained a real voice, participants were correct on average 67.4% of the time… for AI clones, 60.8% of the time.” Our experience is similar: some of the voices we generated could not really be distinguished by living actors.

Shakespearean verse proved the most revealing stress test. ElevenLabs was by far the strongest in English, handling meter and rhetorical flow with surprising accuracy, though it carried a noticeable “English substrate” when generating Romanian or Hungarian verse. Descript Overdub and Resemble struggled with metrical precision—Overdub produced prosodically flat lines, while Resemble was inconsistent, occasionally effective in short passages but unreliable in extended scenes. Voice.ai was largely unusable for verse due to its unstable, “underwater” timbre, suitable only for deliberately non-naturalistic or monstrous characters.

In psychological realism, Descript performed best, offering clear, restrained, documentary-like delivery. ElevenLabs could be effective when carefully moderated, but its default emotionality tended toward melodrama. Resemble remained erratic, and Voice.ai could not sustain naturalistic intimacy except in concepts embracing emotional numbness or estrangement.

Absurd theatre inverted these rankings. ElevenLabs’ polish flattened the jaggedness essential to Ionesco, while Descript’s coherence resisted fragmentation entirely. By contrast, Resemble’s glitches—stutters, pitch shifts, micro-delays—became unexpectedly productive, and Voice.ai’s unstable formants created voices well-suited to characters dissolving into chaos or animality.

In teen theatre, ElevenLabs handled English colloquial speech convincingly but faltered in Romanian and Hungarian slang. Descript’s neutrality worked for exposition but not for spontaneous banter. Resemble occasionally produced energetic, youthful tones, while Voice.ai remained credible only in deliberately stylised or “augmented” youth voices. Across all categories, English consistently performed best, French somewhat better than Romanian or Hungarian, and Hungarian produced the most frequent breakdowns.

4.1 Dramaturgical Implications

These findings suggest several key insights for theatre practice:

There is no “best platform”—only platforms that align with particular dramaturgies.
- Want polished verse? ElevenLabs.
- Want believable realism across languages? ElevenLabs, used with restrained parameters.
- Want glitchy posthuman absurdity? Voice.ai or Resemble.
Glitch is not failure—it is an aesthetic resource.
Voice.ai’s underwater resonance or Resemble’s inconsistency, can all be dramaturgically useful depending on context. As in the case of Net Art pioneers such as JODI (jodi.org, 1995) and Mark Napier (“Shredder,” 1998), whose work foregrounded browser glitches and rendering errors as aesthetic substance, the further refining of these tools will probably suppress glitches as aesthetic effects.
Multilingual productions must select tools strategically.
No platform performed well across all four languages; mixed-language work may require hybrid strategies.
Directability matters.
Platforms that respond well to parameter manipulation (especially ElevenLabs) function more like performers that can be directed.
Synthetic voices acquire character through limitations.
Weaknesses—accent bleed, formant instability, flatness—become the parameters through which a performative agent gains identity.

5. Ethics in Practice: Ownership, Consent, and Audience Address

The comparative torture test foregrounds an ethical problem that cannot be bracketed as “technical detail”: when AI voice platforms are treated as performative agents, they reorganise existing relations of ownership, labour, and responsibility in theatre. The fifth split—between speaker and non-person—does not eliminate the human from the scene; it redistributes human labour and vulnerability in ways that are frequently opaque. An ethics in practice must therefore be articulated not as an abstract code but as a set of situated decisions taken in rehearsal rooms, sound booths, and front-of-house communication.

5.1 Ownership and the Fragmented Voice

AI voice models operate on the premise that the voice can be captured, discretised and recombined as data. For theatre, this intensifies a long-standing ambiguity: who owns a performance? With synthetic voice, the question extends to who owns the model that can now perform indefinitely.

Traditional contracts assume that an actor’s labour is bounded in time and space: a rehearsal period, a run, possibly a recording for archival purposes. When a custom voice is trained on an actor’s recordings, the model extends this labour into the future in ways that are difficult to delimit. Even if a production team uses the model only for one show, the data typically resides on the platform’s servers, governed by terms of service that are rarely negotiated at the scale of independent or small institutional theatre.

From the perspective of performative agency, the synthetic voice is not simply “the actor’s voice.” It is the product of:

the actor’s vocal and interpretive work;
the sound designer’s recording and selection;
the platform’s proprietary architectures and training procedures;
the director’s textual and parametric decisions.

Any ethics of ownership must therefore contend with distributed authorship. Unqualified references to “my” synthetic voice risk erasing the infrastructural and corporate dimensions of its production; conversely, treating the platform as sole owner erases the actor whose data underlies the model. In practical terms, this suggests that contracts and crediting practices should distinguish between voice data, voice model, and voice performance, making explicit who controls each and under what conditions the model may be reused, retrained, or deleted.

5.2 Consent, Scope, and Data Governance

Consent in AI voice work cannot be reduced to the act of signing a release form. The fifth split implies that the actor’s voice may continue to “perform” without their awareness or presence. In this context, meaningful consent must at least include:

Scope – Clear articulation of what the recordings and the resulting model may be used for: this production only; future productions by the same company; broader research; commercial licensing, etc.
Duration and revocability – For how long may the model be stored and used? Can the actor later request its deletion, and what does deletion mean when datasets and backups are involved?
Granularity – Separate consent for using the model in performances versus marketing materials, trailers, or training of further models.
Risk awareness – Explanation, in accessible language, of potential misuse (e.g. voice impersonation, deepfakes) and of what protections the theatre company and platform put in place.

In rehearsal practice, consent must also be iterative, not a one-off event. As artistic needs evolve—for instance, extending a model to new languages or using it to generate entirely new lines not recorded by the actor—directors and sound designers should treat these as fresh consent questions, not as automatically covered by earlier agreements.

Where stock or “generic” synthetic voices are used, consent questions shift toward platform accountability and institutional choice: which companies are we implicitly endorsing and financially supporting through our subscriptions, and what are their policies on dataset provenance and opt-out mechanisms? For a sector that has long foregrounded workers’ rights and inclusion, these choices are not ethically neutral.

Vocal processing rack used for Edgar’s voice in *Lear is Blind* (MISE EN SCENE NGO, Cluj-Napoca, 2024), combining live modulation, spatialisation, and realtime audio effects during performance. Screenshot by the author

5.3 Deception, Transparency, and Audience Address

One of the most charged questions in synthetic voice performance concerns audience deception. Should spectators be told that some voices on stage are artificial? If so, when and how? This set of questions is based on Philip Auslander’s view on Liveness: that liveness is not an ontologically defined condition but a historically variable effect of mediatization. It was the development of recording technologies that made it both possible and necessary to perceive existing representations as “live” (58).

The torture test, combined with the earlier Lear is Blind experience, suggests that audiences are often unable to distinguish reliably between live and recorded or synthetic voice, especially when multiple sensory channels (touch, spatialised audio, live bodies) work together to sustain an impression of liveness. This raises at least three distinct ethical concerns:

Epistemic fairness – Do spectators have a right to know when they are listening to a non-person? If theatre is a space of negotiated fiction, is withholding this information a violation of trust or a legitimate artistic device? Salter asks, pointedly, about “machines performing? This is the age-old ethics question of why and how” and, crucially, for whom, a set of questions that becomes urgent once AI voices take on sustained dramaturgical roles (Salter 276).
Labour visibility – When synthetic voices replace or supplement human actors, does lack of disclosure obscure the labour that has been displaced or repackaged? For instance, marketing a production as featuring a particular star performer while using their synthetic voice in their absence may cross a line from dramaturgical ambiguity into straightforward misrepresentation.
Audience vulnerability – For some spectators, especially in documentary or autobiographical theatre, the human voice carries specific promises of testimony and accountability. Introducing synthetic voice without disclosure in such contexts may harm those for whom the performance’s claims to truth are not merely aesthetic.

Rather than imposing a universal prescription (“always disclose” or “never disclose”), I propose a principle ofcontext-sensitive transparency. The more a performance claims to present real testimony, or to trade on the presence of specific performers, the stronger the obligation to indicate where and how synthetic voices are used—whether in programme notes, pre-show announcements, or post-show discussions. Conversely, in explicitly speculative or fantastical work, non-disclosure may be more defensible, provided that the use of synthetic voice does not exploit real persons’ likeness without consent.

6. Conclusion: Composing with Performative Agents

This article has argued that contemporary AI voice platforms are best understood not as neutral tools but as performative agents inserted into a long history of splitting the theatrical voice. The first four splits—between effort and effect, actual and audible voice, actual and perceived location, time of speaking and time of hearing—were already reshaping theatre’s sonic and ontological landscape throughout the twentieth century. AI voice synthesis completes and destabilises this trajectory by introducing a fifth split:speaker versus non-person.

In this sense, we can consider AI voices as a form of mediated events, following Connor’s definition that says, “A voice is not a condition, nor yet an attribute, but an event. It is less something that exists than something which occurs” (4).

For future research and practice, several directions suggest themselves. Empirically, the torture test could be extended to additional languages, genres (opera, musical theatre, documentary performance) and audience groups (e.g. blind, visually impaired, Deaf and hard-of-hearing spectators), exploring how synthetic voice intersects with accessibility. Conceptually, the notion of the performative agent could be developed in dialogue with labour studies and posthumanist theory, examining how legal and union frameworks might adapt to acknowledge AI systems as part of the performance ecology.

Above all, the presence of AI voices on stage should not be framed solely as a threat to human performance, nor as a simple efficiency gain. It should be understood as a new compositional condition: theatre now unfolds in a field where some of the voices addressing us are non-persons whose capacities and limits we can neither fully control nor fully ignore. To work responsibly in this field is to learn how to compose with these agents—technically, ethically, and dramaturgically—so that the seams of their presence become sites of thought rather than blind spots in our collective imagination of what performance can be.

Bibliography

Auslander, Philip. Liveness: Performance in a Mediatized Culture. 3rd ed., Routledge, 2023.

Barrington, Sarah, Emily A. Cooper, and Hany Farid. “People Are Poorly Equipped to Detect AI-Powered Voice Clones.” Scientific Reports, vol. 15, 2025, article 11004. DOI: 10.1038/s41598-025-94170-3.

Connor, Steven. Dumbstruck: A Cultural History of Ventriloquism. Oxford UP, 2000.

Dennis, Rea, and Kate Hunter. “Embodying a Posthuman Dramaturgy of Place: Distance, Sound and Silence in Australian Contemporary Performance.” Critical Stages/Scènes critiques, no. 28, Dec. 2023.

Dolar, Mladen. A Voice and Nothing More. MIT Press, 2006.

JODI. jodi.org. 1995.

Knight, Will. “AI Voice Actors Sound More Human Than Ever—and Are Ready to Hire.” MIT Technology Review, 9 July 2021. Accessed 11 Nov. 2025.

Muskaan, M. “Synthetic Voice and the Philosophy of Agency, Authenticity, and Ethics in AI-Mediated Speech.” AI and Ethics, vol. 5, 2025, pp. 5889–5908. DOI: 10.1007/s43681-025-00820-7.

Napier, Mark. Shredder. 1998.

Phelan, Peggy. Unmarked: The Politics of Performance. Routledge, 1993.

Salter, Chris. Entangled: Technology and the Transformation of Performance. MIT Press, 2010.

Works Consulted

Ahmed, Sara. The Cultural Politics of Emotion. Edinburgh UP, 2004.

Bay-Cheng, Sarah, Jennifer Parker-Starbuck, and David Z. Saltz, editors. Performance and Media: Taxonomies for a Changing Field. U of Michigan P, 2015.

Braidotti, Rosi. The Posthuman. Polity Press, 2013.

———. Posthuman Knowledge. Polity Press, 2019.

Couldry, Nick, and Ulises A. Mejias. The Costs of Connection: How Data Is Colonizing Human Life and Appropriating It for Capitalism. Stanford UP, 2019.

Cramer, Florian. “What Is ‘Post-digital’?” Postdigital Aesthetics, edited by David M. Berry and Michael Dieter, Palgrave Macmillan, 2015.

Dixon, Steve. Digital Performance: A History of New Media in Theater, Dance, Performance Art, and Installation. MIT Press, 2007.

Freud, Sigmund. “The Uncanny.” 1919. The Standard Edition of the Complete Psychological Works of Sigmund Freud, vol. XVII, translated and edited by James Strachey, Hogarth Press, 1955, pp. [insert page range].

Kaye, Nick. Multi-Media: Video, Installation, Performance. Routledge, 2007.

Massumi, Brian. Parables for the Virtual: Movement, Affect, Sensation. Duke UP, 2002.

Mori, Masahiro. “The Uncanny Valley.” Translated by Karl F. MacDorman and Norri Kageki, IEEE Robotics & Automation Magazine, vol. 19, no. 2, 2012, pp. 98–100.

“Navigating the Challenges and Opportunities of Synthetic Voices.” OpenAI, 29 Mar. 2024. Accessed 11 Nov. 2025.

Respeecher. “How Respeecher’s AI Voice Synthesis Technology Perfected Singing in Emilia Pérez.” Respeecher Case Studies, 23 June 2025. Accessed 11 Nov. 2025.

Székely, Éva, et al. “Will AI Shape the Way We Speak? The Emerging Sociolinguistic Influence of Synthetic Voices.” Proceedings of the 2025 International Workshop on Spoken Dialogue Systems (IWSDS), ACL Anthology, 2025, pp. 357–68.

Thomaidis, Konstantinos, and Ben Macpherson, editors. Voice Studies: Critical Approaches to Process, Performance and Experience. Routledge, 2015.

Zuboff, Shoshana. The Age of Surveillance Capitalism. PublicAffairs, 2019.

*Marius-Alexandru Teodorescu is a theatre director, researcher, and former Assistant Professor at the Faculty of Theatre and Film, Babeș-Bolyai University. He holds a PhD in Theatre and Performance Arts and specializes in digital dramaturgy, scenography, and AI-assisted creation. His work has been presented in France, Israel, and Canada, and he has published in journals such as Studia UBB Dramatica and Theatrical Colloquia, or at prestigious publishers such as Presa Universitară Clujeană and Palgrave Macmillan. He is also active as a cultural manager and project coordinator, with a strong interest in the aesthetics of technology in contemporary stage practice.