Sunday, April 12, 2020

New real-time deep learning Morse decoder


Introduction

I have done some experiments with deep learning models previously. This previous blog post  covers the new approach of building Morse decoder by training a CNN-LSTM-CTC model using audio that is converted to small image frames.

In this latest experiment I trained  a new Tensorflow based CNN-LSTM-CTC model  using 27.8 hours of Morse audio training set  (25,000 WAV files - each clip 4 seconds) and achieved character error rate of 1.5% and word accuracy of 97.2% after 2:29:19 training time. The training data corpus was created from ARRL Morse code practice files (text files).

New real-time deep learning Morse decoder

I wanted to see if this new model is capable of decoding audio in real-time so I wrote a simple Python script to listen microphone, create a spectrogram, detect the CW frequency automatically, and feed 128 x 32 images to the model to perform the decoding inference.

With some tuning of the various components and parameters I was able to put together a working prototype using standard Python libraries and the Tensorflow Morse decoder that is available as open source in Github.

I recorded this sample YouTube video below in order to document this experiment.

Starting from the top left I have FLDIGI  window open decoding CW at 30 WPM speed. On the top middle I have console window open printing the frame number, CW tone frequency followed by "infer_image:" and decoded text as well as the probability that the model assigns to this result.

On the top right I have the Spectrogram window that plots 4 seconds of the audio on a frequency scale.  The morse code is quite readable on this graph.

On the bottom left I have Audacity  playing a sample 30 WPM practice file from ARRL. Finally, on the bottom right I have the 128x32 image frame that I am feeding to the model.





Analysis

The full text at 30 WPM is here - I have highlighted the text section that is playing in the above video clip.

�  NOW 30 WPM  �  TEXT IS FROM JULY 2015 QST  PAGE 99 �

AGREEMENT WITH SOUTHCOM GRANTED ATLAS ACCESS TO THE SC 130S TECHNOLOGY.
THE ATLAS 180 ADAPTED THE MAN PACK RADIOS DESIGN FOR AMATEUR USE.  AN
ANALOG VFO FOR THE 160, 80, 40, AND 20 METER BANDS REPLACED THE SC 130S
STEP TUNED 2 12 MHZ SYNTHESIZER.  OUTPUT POWER INCREASED FROM 20 W TO 100
W.  AMONG THE 180S CHARMS WAS ITS SIZE.  IT MEASURED 9R5 X 9R5 X 3 INCHES.
THATS NOTHING SPECIAL TODAY, BUT IT WAS A TINY RIG IN 1974.  THE FULLY
SOLID STATE TRANSCEIVER FEATURED NO TUNE OPERATION.  THE VFOS 350 KHZ RANGE
REQUIRED TWO BAND SWITCH SEGMENTS TO COVER 75/80 METERS, BUT WAS AMPLE FOR
THE OTHER BANDS.  IN ORDER TO IMPROVE IMMUNITY TO OVERLOAD AND CROSS
MODULATION, THE 180S RECEIVER HAD NO RF AMPLIFIER STAGE THE ANTENNA INPUT
CIRCUIT FED THE RADIOS MIXER DIRECTLY.  A PAIR OF SUCCESSORS EARLY IN 1975,
ATLAS INTRODUCED THE 180S SUCCESSOR IN REALITY, A PAIR OF THEM.  THE NEW
210 COVERED 80 10 METERS, WHILE THE OTHERWISE IDENTICAL 215 COVERED 160 15
METERS HEREAFTER, WHEN THE 210 SERIES IS MENTIONED, THE 215 IS ALSO
IMPLIED.  BECAUSE THE 210 USED THE SAME VFO AND BAND SWITCH AS THE 180,
SQUEEZING IN FIVE BANDS SACRIFICED PART OF 80 METERS.  THAT BAND STARTED AT
�  END OF 30 WPM TEXT  �  QST DE W1AW  �

As can be seen from the YouTube video FLDIGI is able to copy this CW quite well.  The new deep learning Morse decoder is also able to decode the audio with probabilities ranging from 4% to over 90% during this period.

It has visible problems when the current image frame cuts the Morse character into parts. The scrolling  128x32 image that is produced from the spectrogram graph does not have any smarts  - it is just copied at every update cycle and fed into the infer_image() function. This means that a single Morse character is moving out of the frame but some part of the character can be still visible, causing incorrect decodes.

The decoder has also problems with some numbers even when fully visible in the 128x32 image frame.  The ARRL training material that I used to build the corpus for training has about 8.6% words that are numbers (such as bands, frequencies and years).  I believe that the current model doesn't have enough examples to decode all the numbers correctly.

The final problem is the lack of spaces between the words. The current model doesn't know about the "Space" character so it is just decoding what it has been trained on.


Software

The python script running the model is quite simple and listed below. I adapted the main Spectogram loop from this Github repo.  I used the following constants in mic_read.py.

RATE = 8000
FORMAT = pyaudio.paInt16 #conversion format for PyAudio stream
CHANNELS = 1 #microphone audio channels
CHUNK_SIZE = 8192 #number of samples to take per read
SAMPLE_LENGTH = int(CHUNK_SIZE*1000/RATE) #length of each sample in ms


specgram.py

"""
Created by Mauri Niininen (AG1LE)
Real time Morse decoder using CNN-LSTM-CTC Tensorflow model

adapted from https://github.com/ayared/Live-Specgram

"""
############### Import Libraries ###############
from matplotlib.mlab import specgram
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import numpy as np
import cv2


############### Import Modules ###############
import mic_read
from morse.MorseDecoder import  Config, Model, Batch, DecoderType


############### Constants ###############
SAMPLES_PER_FRAME = 4 #Number of mic reads concatenated within a single window
nfft = 256 # NFFT value for spectrogram
overlap = nfft-56 # overlap value for spectrogram
rate = mic_read.RATE #sampling rate


############### Call Morse decoder ###############
def infer_image(model, img):
    if img.shape == (128, 32):
        batch = Batch(None, [img])
        (recognized, probability) = model.inferBatch(batch, True)
        return img, recognized, probability
    else:
        print(f"ERROR: img shape:{img.shape}")

# Load the Tensorlow model 
config = Config('model.yaml')
model = Model(open("morseCharList.txt").read(), config, decoderType = DecoderType.BestPath, mustRestore=True)

stream,pa = mic_read.open_mic()


############### Functions ###############
"""
get_sample:
gets the audio data from the microphone
inputs: audio stream and PyAudio object
outputs: int16 array
"""
def get_sample(stream,pa):
    data = mic_read.get_data(stream,pa)
    return data
"""
get_specgram:
takes the FFT to create a spectrogram of the given audio signal
input: audio signal, sampling rate
output: 2D Spectrogram Array, Frequency Array, Bin Array
see matplotlib.mlab.specgram documentation for help
"""
def get_specgram(signal,rate):
    arr2D,freqs,bins = specgram(signal,window=np.blackman(nfft),  
                                Fs=rate, NFFT=nfft, noverlap=overlap,
                                pad_to=32*nfft   )
    return arr2D,freqs,bins

"""
update_fig:
updates the image, just adds on samples at the start until the maximum size is
reached, at which point it 'scrolls' horizontally by determining how much of the
data needs to stay, shifting it left, and appending the new data. 
inputs: iteration number
outputs: updated image
"""
def update_fig(n):
    data = get_sample(stream,pa)
    arr2D,freqs,bins = get_specgram(data,rate)
    
    im_data = im.get_array()
    if n < SAMPLES_PER_FRAME:
        im_data = np.hstack((im_data,arr2D))
        im.set_array(im_data)
    else:
        keep_block = arr2D.shape[1]*(SAMPLES_PER_FRAME - 1)
        im_data = np.delete(im_data,np.s_[:-keep_block],1)
        im_data = np.hstack((im_data,arr2D))
        im.set_array(im_data)

    # Get the image data array shape (Freq bins, Time Steps)
    shape = im_data.shape

    # Find the CW spectrum peak - look across all time steps
    f = int(np.argmax(im_data[:])/shape[1])

    # Create a 32x128 array centered to spectrum peak 
    if f > 16: 
        print(f"n:{n} f:{f}")
        img = cv2.resize(im_data[f-16:f+16][0:128], (128,32)) 
        if img.shape == (32,128):
            cv2.imwrite("dummy.png",img)
            img = cv2.transpose(img)
            img, recognized, probability = infer_image(model, img)
            if probability > 0.0000001:
                print(f"infer_image:{recognized} prob:{probability}")
    return im,

def main():
    
    global im
    ############### Initialize Plot ###############
    fig = plt.figure()
    """
    Launch the stream and the original spectrogram
    """
    stream,pa = mic_read.open_mic()
    data = get_sample(stream,pa)
    arr2D,freqs,bins = get_specgram(data,rate)
    """
    Setup the plot paramters
    """
    extent = (bins[0],bins[-1]*SAMPLES_PER_FRAME,freqs[-1],freqs[0])
    
    im = plt.imshow(arr2D,aspect='auto',extent = extent,interpolation="none",
                    cmap = 'Greys',norm = None) 

    plt.xlabel('Time (s)')
    plt.ylabel('Frequency (Hz)')
    plt.title('Real Time Spectogram')
    plt.gca().invert_yaxis()
    #plt.colorbar() #enable if you want to display a color bar

    ############### Animate ###############
    anim = animation.FuncAnimation(fig,update_fig,blit = True,
                                interval=mic_read.CHUNK_SIZE/1000)

                                
    try:
        plt.show()
    except:
        print("Plot Closed")

    ############### Terminate ###############
    stream.stop_stream()
    stream.close()
    pa.terminate()
    print("Program Terminated")

if __name__ == "__main__":
    main()

I did run this experiment on Macbook Pro (2.2 GHz Quad-Core Intel Core i7) and MacOS Catalina 10.15.3.  The Python version used was Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 05:52:31)  [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin

Conclusions

This experiment demonstrates that it is possible to build a working real time Morse decoder based on deep learning Tensorflow model using a slow interpreted language like Python.  The approach taken here is quite simplistic and lacks some key functionality, such as alignment of decoded text to audio timeline.

It also shows that there are still more work to do in order to build a fully functioning, open source and high performance Morse decoder.  A better event driven software architecture would allow building a proper user interface with some controls, like audio filtering.   Such an architecture would enable also building server side decoders running based on audio feeds from WebSDR receivers etc.

Finally, the Tensorflow model in this experiment has a very small training set, only 27.8 hours of audio.  If you compare to commercial ASR (automatic speech recognition) engines they have been trained using over 1000X  more labeled audio training material.   To get better performance from deep learning models you need to have a lot of high quality labeled training material that matches with the typical sound environment the model will be used on.


73
Mauri AG1LE