Convolution/Interpolation kernels and windowing in Digital Signal Processing


Hi, I’m currently approaching Digital Signal Processing as I’m planning to integrate in my app a tool to visualize via spectrogram, and maybe in the future recognize via extracting audio features, a morse code input through open mic (that is, subject to all kind of background noises such as human voice, ambient sounds etc). I have a background with continuous Fourier Transform from my Signal Theory class at university.

Now, the actual problem is that the internet isn’t greedy of material about the subject, but frustratingly enough all I could find expect you to have a solid knowledge of the subject, which I don’t. So to be clear, my current task is the following:

**GOAL:** Allow the user to view a recorded audio file either as a waveform or a spectrogram, allowing them to smoothly (60+ fps) zoom in/out to increase/decrease level of detail about the audio, while maintaining a good quality of the selected visualization mode (**for waveform**: smooth envelope but without removing relevant details about the pitches; **for spectrogram**: allow zooming the timescale while maintaining a good image quality)

So, here’s a suggestion I got:

>In either domain, a good (visually appealing with minimal information loss) way to smoothly zoom into a signal is to use a Sinc-like (windowed Sinc of some width) interpolation kernel for the downsampling. A Sinc interpolation kernel in either domain acts as smoother, summarizing local information. In the time domain, a Sinc interpolator of the proper width acts as a low pass filter suitable for anti-aliasing.

So now, the thing is this:

* In the time domain, if I want to downsample an audio file, obtaining a sample of downscaled audio for each h (h stands for “hop” in my notation) samples, I proceed as follows: I take a group of nearby samples, called a *window* (128 samples per window in my current implementation), and perform some elaboration on that window to compute the next downsampled sample. Then I slide the window by h samples and repeat until the next window doesn’t exceed the original samples count.
* In the frequency domain, I have no clue how am I supposed to apply windowing (I’m computing the spectrogram via STFT)

So now, my questions are:

1. What the heck is an ***interpolation kernel*** anyway? Should I sample a sinc function centered in my window, apply a window (say Blackman-Harris) to it and then multiply the samples for such window (i.e. apply that locally)? Or should I compute the convolution of the whole original audio samples sequence by a given sinc function multiplied by a window, and then take a sample from the resulting signal every h samples (i.e. apply it globally)?
2. In any of these cases, since the tone of the morse code can vary and it’s not known a priori, how do I choose an appropriate sinc *width*? And what is the sinc width defined as anyway? Is it the cut frequency? As in the sinc is s(t)=2w*sinc(2w*t) with w its width/cut frequency? Or is it the number of samples I keep from the continuous time sinc function?
3. What does applying such kernel means in the frequency domain? My frequency domain representation of the signal is a matrix over complex numbers with a row for each frequency bin and a column for each samples window. Should I just multiply by the Fourier transform of the selected interpolation kernel?

In: 1