AI generates audios riddled with sexism, racism and copyright infringements
Analysis of 680,000 hours of sound in AI-generated repositories reveals offensive and unauthorized content, as is already the case with texts and images
From melodies and voice transcription to assistance for the visually impaired, generative sound artificial intelligence (AI) has made great strides to the point where it is now capable of creating high-quality audio. Despite this, the data used to train AI has overlooked biases, offensive language and the use of copyrighted content, a study claims. A team of researchers has carried out an exhaustive review of 175 speech, music and sound data sets, and in a preliminary work they warn that there is biased material, similar to what has been found in text and image databases.
For a year, scientists led by William Agnew of Carnegie Mellon University (USA) studied 680,000 hours of audio from seven platforms and 600 investigations in total to analyze their content, biases and provenance. The scientists obtained transcripts of speeches and song lyrics, most of which were in English. The files included voice recordings — sentences read by people — and musical pieces from platforms such as AudioSet and Free Music Archive, as well as two million 10-second YouTube videos.
Researchers believe that if stereotypes are not properly addressed, audio datasets can generate patterns that “perpetuate or even accelerate” prejudices and distorted conceptions of reality. Julia Barnett, a PhD in computer science at Northwestern University and collaborator on the study, says that people are unaware of biases. “As a result, considering a dataset as a reflection of humanity without understanding its true composition will generate numerous negative effects down the road,” she says.
For Andrés Masegosa, an artificial intelligence expert and associate professor at Aalborg University in Denmark, there is nothing surprising about biases: “This technology is able to extract patterns from a data set and simply tries to replicate what already exists.” AI works very similarly to human learning, he suggests. “If you expose a child to sexist behavior, he will reproduce that bias in a purely unconscious way,” says the academic, who was not involved in the research.
“There are many attempts to avoid biases and what is clear is that the models lose capacity. There is a debate in the field of AI that is reflected in the different visions that each society has,” adds Masegosa. The expert recognizes that the study carried out is a large audit, and believes that examining the data sets is quite a costly job.
Unlike text data, audio data requires more storage, says Sauvik Das, a researcher at Carnegie Mellon University’s Human-Computer Interaction Institute who was involved in the research. This means that they require much higher processing power to be audited. “We need more data to have higher quality models,” he argues.
The voice is a biometric data
The potential harm of generative audio technologies is not yet known. Scientists suggest that this type of content will have social and legal implications ranging from people's right of publicity, misinformation and intellectual property, especially when these systems are trained with data used without authorization. The study notes that at least 35% of the audios analyzed contained content protected by copyright or author's rights.
The voice is related to the right to one's own image, as it is part of a person's physical characteristics. Borja Adsuara, a lawyer specializing in digital law, points out that the voice has the same problems as text and images generated with AI, in relation to data protection and intellectual property. “The voice is a biometric data and is especially protected like the fingerprint or the iris of the eye. It can be violated if its use is not authorized,” explains this specialist.
Adsuara recalls the well-known controversy involving actress Scarlett Johansson, when in May 2024 the chatbot Sky , from OpenAI, had a tone similar to her voice. AI has also used the voices of musicians to simulate singing melodies that they have never performed, as happened with the Puerto Rican Bad Bunny and the Spanish artist Bad Gyal . “It not only infringes the image rights to the voice itself, but also the intellectual property rights to the interpretation. The problems are the same and what generative artificial intelligence does is make it much easier to commit a crime or commit an intrusion,” he explains.
Constanza Cabrera . El Pais, Spain