I’m sitting in my office overlooking Mile End listening to a 20 year-old recording of two people sitting in their kitchen, chatting over the sound of BBC Radio 3 about their friends, their weekend, what’s on TV, and about how prim and proper Swiss people are.

It feels like being magically transported back to 1993, when the British National Corpus recruited 124 men and women of balanced ages and demographically assigned social classes and asked them to carry around tape recorders to capture the conversations they had with friends, family, neighbours and co-workers every day.

Over 700 hours of conversation were recorded and painstakingly transcribed and annotated to enable researchers to analyse an immense corpus of naturalistic language data (still representing only 10% of the total data in the BNC, mostly comprised of written books and journals and transcribed broadcasts). All this data has been used as a primary resource by computational linguists, natural language researchers, sociologists and all kinds of researchers measuring their models of language learning and production against the empirical evidence.

However, for the most part, only the text transcriptions of this data rather than the audio itself have been easily accessible to researchers until very recently. In the last year, the Oxford Phonetics Lab has produced a British National Corpus Spoken Audio Sampler, after digitising, cataloguing and analysing the mountain of audio casettes that were hidden away in the British Library Sound Archive. They are soon going to make the entire “Audio BNC” available online to anyone who wants to listen to the original recordings on which so much research has been based, and the director Professor John Coleman kindly made selected recordings available to me as a beta tester.

Using Matthew Purver’s SCoRE BNC search tool, I’ve been able to do a full-text search of the Audio BNC, and find naturalistic examples of conversations on specific topics (I’m looking for people talking about art, design, fashion, architecture, or otherwise engaging in aesthetic discussions), and then just dip into their lives at those specific moments. It is fascinating. The sense of omnipotence is almost intoxicating, especially because sitting here, listening and reading along with the original 1990’s transcriptions, I get a strong sense how much has changed in terms of the knowledge production tools available to researchers since then.

The text transcriptions I’m reading are full of instances in which the transcriber says the speech is <unclear>, where references and names of things being referred to are omitted. Especially as I’m looking for people talking about art, I’ve found that almost all of the names of artists, musicians or other cultural references made by people in conversation are labelled <unclear> – understandably as how can the transcriber be expected to have a familiarity with relatively obscure painters from Swiss art history? With just a few contextual references, Google and Wikipedia make it trivial to identify about 90% of these <unclear> instances. Similarly, pressing my android phone to my headphone speakers and running Shazam, I’m able to identify what music they’re listening to on the radio in their kitchen while they chat.

One of the most powerful things about the Audio BNC being released today is the opportunity to apply contemporary search and analysis tools to finding instances of naturalistic conversation from a huge range of contexts and situations involving different professional, social demographic and cultural groups, and then drop in and listen to what’s going on. Having pored over the transcriptions of these people’s speech, it’s a fantastic revelation to hear their accents, intonations, and get a sense of the detail of how they do the work of ‘being ordinary’ in the privacy of their homes and intimate relationships, then in public, then at the office.

It’s the ultimate fly-on-the-wall experience, and it feels like sitting in front of a new telescope, suddenly able to inspect in great detail specific areas of a previously vague and undifferentiated view of a distant galaxy.