User:Angeliki/Grad project speech analysis

From XPUB & Lens-Based wiki

A. Re- humanizing voice samples

What do I want to make?

I aim to raise awareness on the matter of speech recognition and analysis tools that are altering our perspective for our own embodied voices. Beyond their experimental or commercial uses they are also used to control our access in a country or institutions and affect our behaviour with our bodies and technology. Those tools that are recently very broadly used are trained by a database of 'real' voice samples. These samples come from different sources around the world [research projects, frequencies, radio] sometimes with the permission of the people donating their voice. The more the accents are and the bigger the dataset becomes the better the tool can be trained. At the same time these tools (like automatic dialect analysis) are used from states [Germany] to verify the claims of origin of refugees. It is very often that this process can get wrong because "Identifying the region of origin for anyone based on their speech is an extremely complex task" and depends on "a wide range of factors".[1]

How do I plan to make it?

I will do this by relating the two sides of the tool and rethink/hack this tool in a way that open a conversation around this issue/ radicalise these tools[maybe re-enact the collection of data by frequencies or reading texts]. To rethink these tools differently I would check other counter-practices and tools used to bring awareness on our voices and bodies. Use these tools on a way that people can express their will.
I intend to open this process of donating our voice samples and propose other ways to perceive it. I will open up the archive of voice samples to the public/ broadcasting in physical space the archive of voice samples or searching for new ones with live scanning with antennas.

What is my timetable?

Interview with Mitra Azar on disembodiment of the gaze and the appropriation on the new technologies that surveil us.

I want to focus on the topics that you raised on a podcast of the Radio Web Macba and relate that with the respective disembodiment of the voice [27:40 Gaze disembodiment, 29:40 Selfies: the first step of disembodiment, 31:00 P.O.V., F.P.V,, drones, CCTV.... Is it possible to inject agency into a disembodied image?]
- Describe further your empirical methodology. How do you first try these tools [GPS, CCTV, drones] by yourself?
- What do you mean by disembodiment? What do you think are the effects of it?
- I would like to discuss about the disembodiment I observe regarding the speech.
- How you re-use these tools, injecting into them political agency?

Why do I want to make it?

I want to do this project because we are gradually and rapidly donate our personal data in big organisations in the shake of the "public good" without thinking of it [ethics, politics on our bodies and social behaviour]. I also observe that all these new politics that enter our private and collective spheres estrange [2] us from our relations with the others and our surroundings/ dehumanize our lives. Especially I am interested in voice because our voice is a personal and unique element very related to our bodily conditions. For oral cultures the voice was a medium to spread knowledge, on a way that differs a lot from the writing cultures, "When auditory experiences are shared, histories too are shared, and not only from mouth to ear: they are perceived by and encoded in the body through the physicality of sound waves and passed on from one generation to another."[3] "should we be worried about the large-scale harvesting of our voiceprints?" "The companies behind this technology say that a voiceprint includes more than 100 unique physical and behavioural characteristics of each individual, such as length of the vocal tract, nasal passage, pitch, accent and so on. They claim it is as unique to an individual as a fingerprint, and that their systems even recognise people if they have a cold or sore throat."[4]. "Your voice is yours alone – as unique to you as your fingerprints, eyeballs and DNA." [5]

Who can help me and how?

Michael with the software of speech recognition [training, hacking,...]


Relation to a larger context




Examples that reflect that topic

Voice samples for training speech analysis software (LDC). Tracing the samples Using speech analysis software to verify voice samples


what data: ordered samples or real samples (broadcast conversations, broadcast news, field recordings[air traffic, walking/noise background, ], meeting speech, microphone conversation, microphone speech, telephone conversations, telephone speech, transcribed speech, video) examples of verification: diagnostic tool(for disease, depression), personal assistants (humanize the software voice), refugees seeking asylum/verification of claims of origin/Germany, banks


from where: universities (of linguistics) around the world, research projects or satellites, radio


extracts of descriptions of the samples: "Transcripts have been made of all recordings in this publication, manually time aligned to the phrasal level, annotated to identify boundaries between news stories, speaker turn boundaries and gender information about the speakers.", "The audio files are 8 KHz, 16-bit linear sampled data, representing continuous monitoring, without squelch or silence elimination, of a single FAA frequency for one to two hours.", "The Air Traffic Control Corpus (ATC0) is comprised of recorded speech for use in supporting research and development activities in the area of robust speech recognition in domains similar to air traffic control (several speakers, noisy channels, relatively small vocabulary, constrained languaged, etc.) The audio data is composed of voice communication traffic between various controllers and pilots."

Photo of Byron Bay, one of Australia's best beaches!

with permission from the users or not in the case of real samples matter of privacy, de-humanizing automated processes regarding control of the body

Some audio samples with their transcriptions:


(microphone) LDC93S1 0 46797 She had your dark suit in greasy wash water all year.


(broadcast conversation) por que al fin y al cabo el miedo de la mujer a la violencia del hombre es el espejo del miedo del hombre a la mujer sin miedo CMPB_M_32_01IVN_00004


(microphone conversation)

Interview 15
(A=Interviewer; B=Interviewee)
A: So we are recording.  Awesome.  So how long have you lived in Flint? (unclear)
B: 38 years.
A: Is that your whole life?  Wow you look really young.
B: Thank you!(...)

(air traffic)
((TAPE-HEADER "TAPE02; LOGAN, BOSTON ATCT; FINAL ONE, F1; 126.5 MHz; 26 JUNE 1991, 2012 TO 2212 UTC; TRANSCRIBER FR"))

((COMMENT 
   "CONTAINS TWO CONTROLLER CHANGES AND SOME PILOT CHANGES; NUMEROUS COMMENTS AND TRANSMISSIONS TO TWO FLIGHTS AT ONCE"))

((COMMENT 
   "ONE TRANSMISSION AT END OF ORIGINAL TAPE WAS CUT OFF IN THE OFFICIAL COPY; THIS WAS DELETED FROM THE TRANSCRIPT"))

((FROM NERA3788) (NUM L02F1-0001)
 (TO F1-1)
 (TEXT THOUSAND ONE NINETY WE (QUOTE LL) GIVE YOU THAT ON THE SPEED AND WE 
   (QUOTE RE) CLEARED FOR THE APPROACH AH NERA THIRTY SEVEN EIGHTY EIGHT WE 
   (QUOTE LL) HOLD SHORT OF TWO SEVEN)
 (TIMES 1.49 6.57))

((FROM F1-1) (NUM L02F1-0002)
 (TO NERA3788 GAA329)
 (TEXT THANKS BIZEX THREE TWENTY NINE TURN LEFT HEADING ONE CORRECTION ZERO 
   NINER ZERO)
 (TIMES 6.59 11.17)
 (COMMENT "CONTROLLER TALKED TO TWO AIRCRAFT IN SAME TRANSMISSION"))(...)

(telephone conversation/ giving directions on spot while walking)

Other examples of collecting voice samples for training

Common Voice of Mozilla They ask the users to donate their voice and also check transcriptions made by the speech recognition software. They also use online databases with voice samples as discussed in their forum. Tatoeba is one
Tom refused our help.