User:Ruben/Prototyping/Sound and Voice: Difference between revisions

From XPUB & Lens-Based wiki
(Created page with "A project using voice recognition ([http://cmusphinx.sourceforge.net/ Pocketsphinx]) with Python. <ref name='tutorial>https://mattze96.safe-ws.de/blog/?p=640</ref> <referenc...")
 
No edit summary
Line 1: Line 1:
A project using voice recognition ([http://cmusphinx.sourceforge.net/ Pocketsphinx]) with Python. <ref name='tutorial>https://mattze96.safe-ws.de/blog/?p=640</ref>
A project using voice recognition ([http://cmusphinx.sourceforge.net/ Pocketsphinx]) with Python. <ref name='tutorial>https://mattze96.safe-ws.de/blog/?p=640</ref>
This script has undergone many iterations.
The first version merely extracted the spoken pieces.
[[File:VoiceDetection1.png|200px|thumbnail|right|ugly graph of the second version]]
The second version created an ugly graph to show how many was spoken in a certain part of a film (according to speech recognitions, which often detects things which are not there)
[[File:VoiceDetection2.png|200px|thumbnail|right|A third version]]
A third version could detect the spoken language using Pocketsphinx. Then it used ffmpeg and imagemagick to extract frames from the film, which are appended into a single image. This image is then overlaid by a black gradient when there is spoken text, as to 'hide' the image.




<references></references>
<references></references>

Revision as of 01:03, 15 January 2015

A project using voice recognition (Pocketsphinx) with Python. [1]

This script has undergone many iterations.

The first version merely extracted the spoken pieces.

ugly graph of the second version

The second version created an ugly graph to show how many was spoken in a certain part of a film (according to speech recognitions, which often detects things which are not there)

A third version

A third version could detect the spoken language using Pocketsphinx. Then it used ffmpeg and imagemagick to extract frames from the film, which are appended into a single image. This image is then overlaid by a black gradient when there is spoken text, as to 'hide' the image.