Ndiyo means yes in Swahili and hapana means no. The two words were chosen as they can be used to make very simple applications that may require a yes/no response such as an automatic telephone prompt system. Swahili translations of yes/no applies to the national language of Kenya, and an interface that recognizes the words can be used in local applications.
Simple speech recognition
The first step is to obtain speech samples of ndiyo and hapana from a large sample of people throughout the country. This can be done using a simple microphone and a recording instrument. For this project however, I took the generic pronunciations of the words from Google translate: ndiyo, hapana (I know, sounds weird).
We can now plot the two audio files as a periodogram Power Spectral Density plot using fast Fourier transform (FFT) in MATLAB. FFT is simply an algorithm that makes computation of discrete Fourier transforms (DFT) more efficient by reducing the amount of computations involved.
>> [signal,fs] = audioread('ndiyo.mp3'); >> N = length(signal); >> xdft = fft(signal); >> xdft = xdft(1:N/2+1); >> psdx = (1/(fs*N)) * abs(xdft).^2; >> psdx(2:end-1) = 2*psdx(2:end-1); >> freq = 0:fs/length(signal):fs/2; >> plot(freq,10*log10(psdx)) >> grid; >> title('Periodogram Using FFT') >> xlabel('Frequency (Hz)')
OR
>> [signal,fs] = audioread('ndiyo.mp3'); >> plot(psd(spectrum.periodogram,signal,'Fs',fs,'NFFT',length(signal)));
This gives the plots in figures 1 and 2.
According to the two plots above, we can see that the signal for ndiyo has more energy in the lower frequencies than that of hapana. We can use this feature to differentiate the two signals. When the signals approach 4kHz however, they exhibit features that are similar and harder to differentiate. Trial and error resulted in a range of 0 to 3620 Hz for the lower frequencies and 3620 to 11025 for the higher frequencies. A threshold value is necessary for the separation of the features, this value is obtained by calculating the feature for all of the audio samples and examining the histogram for the ndiyo and hapana values. I chose a threshold value of 12 as an example but in practice this figure should be computed. The speech recognition algorithm is:
function output = ndiyo_hapana(x,fs) % Simple algorithm for deciding whether the audio signal % in vector x is the word 'ndiyo' or 'hapana'. % x (vector) speech signal % fs (scalar) sampling frequency in Hz % output (string) 'ndiyo' or 'hapana' threshold = 12; % threshold value N = length(x); k1 = round(N*3620/fs); % FFT component corresponding to 3650 Hz k2 = round(N*11025/fs); % FFT component corresponding to 11025 Hz X = abs(fft(x)); f = sum(X(1:k1))/sum(X(k1:k2)); % calculate feature if f < threshold, output = 'ndiyo'; % if feature is below threshold, return 'ndiyo' else output = 'hapana'; % if feature is above threshold, return 'hapana' end
Using this algorithm, the output for the speech recognition function on the two audio files is:
>> [x,fs] = audioread('ndiyo.mp3'); >> ndiyo_hapana(x,fs) ans = ndiyo >> [x,fs] = audioread('hapana.mp3'); >> ndiyo_hapana(x,fs) ans = hapana
This shows that the algorithm exercise was successful in distinguishing the speech from two audio sources using a simple recognition algorithm.
REFERENCES AND SUPPORTING ARTICLES
- ‘Power Spectral Density Estimates Using FFT’ https://www.mathworks.com/help/signal/ug/power-spectral-density-estimates-using-fft.html.
- ‘DSP Mini-Project: An Automatic Speaker Recognition System’ http://minhdo.ece.illinois.edu/teaching/speaker_recognition/speaker_recognition.html.
- ‘Basic feature extraction and classification of audio files’ https://ccrma.stanford.edu/workshops/mir2011/Lab_1_2011.pdf.
- Enhance your DSP Course with These Interesting Projects http://www.asee.org/file_server/papers/attachment/file/0002/2611/Enhance_your_DSP_Course_with_these_Interesting_Projects.pdf.