Sunday, April 12, 2020

New real-time deep learning Morse decoder


Introduction

I have done some experiments with deep learning models previously. This previous blog post  covers the new approach of building Morse decoder by training a CNN-LSTM-CTC model using audio that is converted to small image frames.

In this latest experiment I trained  a new Tensorflow based CNN-LSTM-CTC model  using 27.8 hours of Morse audio training set  (25,000 WAV files - each clip 4 seconds) and achieved character error rate of 1.5% and word accuracy of 97.2% after 2:29:19 training time. The training data corpus was created from ARRL Morse code practice files (text files).

New real-time deep learning Morse decoder

I wanted to see if this new model is capable of decoding audio in real-time so I wrote a simple Python script to listen microphone, create a spectrogram, detect the CW frequency automatically, and feed 128 x 32 images to the model to perform the decoding inference.

With some tuning of the various components and parameters I was able to put together a working prototype using standard Python libraries and the Tensorflow Morse decoder that is available as open source in Github.

I recorded this sample YouTube video below in order to document this experiment.

Starting from the top left I have FLDIGI  window open decoding CW at 30 WPM speed. On the top middle I have console window open printing the frame number, CW tone frequency followed by "infer_image:" and decoded text as well as the probability that the model assigns to this result.

On the top right I have the Spectrogram window that plots 4 seconds of the audio on a frequency scale.  The morse code is quite readable on this graph.

On the bottom left I have Audacity  playing a sample 30 WPM practice file from ARRL. Finally, on the bottom right I have the 128x32 image frame that I am feeding to the model.





Analysis

The full text at 30 WPM is here - I have highlighted the text section that is playing in the above video clip.

�  NOW 30 WPM  �  TEXT IS FROM JULY 2015 QST  PAGE 99 �

AGREEMENT WITH SOUTHCOM GRANTED ATLAS ACCESS TO THE SC 130S TECHNOLOGY.
THE ATLAS 180 ADAPTED THE MAN PACK RADIOS DESIGN FOR AMATEUR USE.  AN
ANALOG VFO FOR THE 160, 80, 40, AND 20 METER BANDS REPLACED THE SC 130S
STEP TUNED 2 12 MHZ SYNTHESIZER.  OUTPUT POWER INCREASED FROM 20 W TO 100
W.  AMONG THE 180S CHARMS WAS ITS SIZE.  IT MEASURED 9R5 X 9R5 X 3 INCHES.
THATS NOTHING SPECIAL TODAY, BUT IT WAS A TINY RIG IN 1974.  THE FULLY
SOLID STATE TRANSCEIVER FEATURED NO TUNE OPERATION.  THE VFOS 350 KHZ RANGE
REQUIRED TWO BAND SWITCH SEGMENTS TO COVER 75/80 METERS, BUT WAS AMPLE FOR
THE OTHER BANDS.  IN ORDER TO IMPROVE IMMUNITY TO OVERLOAD AND CROSS
MODULATION, THE 180S RECEIVER HAD NO RF AMPLIFIER STAGE THE ANTENNA INPUT
CIRCUIT FED THE RADIOS MIXER DIRECTLY.  A PAIR OF SUCCESSORS EARLY IN 1975,
ATLAS INTRODUCED THE 180S SUCCESSOR IN REALITY, A PAIR OF THEM.  THE NEW
210 COVERED 80 10 METERS, WHILE THE OTHERWISE IDENTICAL 215 COVERED 160 15
METERS HEREAFTER, WHEN THE 210 SERIES IS MENTIONED, THE 215 IS ALSO
IMPLIED.  BECAUSE THE 210 USED THE SAME VFO AND BAND SWITCH AS THE 180,
SQUEEZING IN FIVE BANDS SACRIFICED PART OF 80 METERS.  THAT BAND STARTED AT
�  END OF 30 WPM TEXT  �  QST DE W1AW  �

As can be seen from the YouTube video FLDIGI is able to copy this CW quite well.  The new deep learning Morse decoder is also able to decode the audio with probabilities ranging from 4% to over 90% during this period.

It has visible problems when the current image frame cuts the Morse character into parts. The scrolling  128x32 image that is produced from the spectrogram graph does not have any smarts  - it is just copied at every update cycle and fed into the infer_image() function. This means that a single Morse character is moving out of the frame but some part of the character can be still visible, causing incorrect decodes.

The decoder has also problems with some numbers even when fully visible in the 128x32 image frame.  The ARRL training material that I used to build the corpus for training has about 8.6% words that are numbers (such as bands, frequencies and years).  I believe that the current model doesn't have enough examples to decode all the numbers correctly.

The final problem is the lack of spaces between the words. The current model doesn't know about the "Space" character so it is just decoding what it has been trained on.


Software

The python script running the model is quite simple and listed below. I adapted the main Spectogram loop from this Github repo.  I used the following constants in mic_read.py.

RATE = 8000
FORMAT = pyaudio.paInt16 #conversion format for PyAudio stream
CHANNELS = 1 #microphone audio channels
CHUNK_SIZE = 8192 #number of samples to take per read
SAMPLE_LENGTH = int(CHUNK_SIZE*1000/RATE) #length of each sample in ms


specgram.py

"""
Created by Mauri Niininen (AG1LE)
Real time Morse decoder using CNN-LSTM-CTC Tensorflow model

adapted from https://github.com/ayared/Live-Specgram

"""
############### Import Libraries ###############
from matplotlib.mlab import specgram
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import numpy as np
import cv2


############### Import Modules ###############
import mic_read
from morse.MorseDecoder import  Config, Model, Batch, DecoderType


############### Constants ###############
SAMPLES_PER_FRAME = 4 #Number of mic reads concatenated within a single window
nfft = 256 # NFFT value for spectrogram
overlap = nfft-56 # overlap value for spectrogram
rate = mic_read.RATE #sampling rate


############### Call Morse decoder ###############
def infer_image(model, img):
    if img.shape == (128, 32):
        batch = Batch(None, [img])
        (recognized, probability) = model.inferBatch(batch, True)
        return img, recognized, probability
    else:
        print(f"ERROR: img shape:{img.shape}")

# Load the Tensorlow model 
config = Config('model.yaml')
model = Model(open("morseCharList.txt").read(), config, decoderType = DecoderType.BestPath, mustRestore=True)

stream,pa = mic_read.open_mic()


############### Functions ###############
"""
get_sample:
gets the audio data from the microphone
inputs: audio stream and PyAudio object
outputs: int16 array
"""
def get_sample(stream,pa):
    data = mic_read.get_data(stream,pa)
    return data
"""
get_specgram:
takes the FFT to create a spectrogram of the given audio signal
input: audio signal, sampling rate
output: 2D Spectrogram Array, Frequency Array, Bin Array
see matplotlib.mlab.specgram documentation for help
"""
def get_specgram(signal,rate):
    arr2D,freqs,bins = specgram(signal,window=np.blackman(nfft),  
                                Fs=rate, NFFT=nfft, noverlap=overlap,
                                pad_to=32*nfft   )
    return arr2D,freqs,bins

"""
update_fig:
updates the image, just adds on samples at the start until the maximum size is
reached, at which point it 'scrolls' horizontally by determining how much of the
data needs to stay, shifting it left, and appending the new data. 
inputs: iteration number
outputs: updated image
"""
def update_fig(n):
    data = get_sample(stream,pa)
    arr2D,freqs,bins = get_specgram(data,rate)
    
    im_data = im.get_array()
    if n < SAMPLES_PER_FRAME:
        im_data = np.hstack((im_data,arr2D))
        im.set_array(im_data)
    else:
        keep_block = arr2D.shape[1]*(SAMPLES_PER_FRAME - 1)
        im_data = np.delete(im_data,np.s_[:-keep_block],1)
        im_data = np.hstack((im_data,arr2D))
        im.set_array(im_data)

    # Get the image data array shape (Freq bins, Time Steps)
    shape = im_data.shape

    # Find the CW spectrum peak - look across all time steps
    f = int(np.argmax(im_data[:])/shape[1])

    # Create a 32x128 array centered to spectrum peak 
    if f > 16: 
        print(f"n:{n} f:{f}")
        img = cv2.resize(im_data[f-16:f+16][0:128], (128,32)) 
        if img.shape == (32,128):
            cv2.imwrite("dummy.png",img)
            img = cv2.transpose(img)
            img, recognized, probability = infer_image(model, img)
            if probability > 0.0000001:
                print(f"infer_image:{recognized} prob:{probability}")
    return im,

def main():
    
    global im
    ############### Initialize Plot ###############
    fig = plt.figure()
    """
    Launch the stream and the original spectrogram
    """
    stream,pa = mic_read.open_mic()
    data = get_sample(stream,pa)
    arr2D,freqs,bins = get_specgram(data,rate)
    """
    Setup the plot paramters
    """
    extent = (bins[0],bins[-1]*SAMPLES_PER_FRAME,freqs[-1],freqs[0])
    
    im = plt.imshow(arr2D,aspect='auto',extent = extent,interpolation="none",
                    cmap = 'Greys',norm = None) 

    plt.xlabel('Time (s)')
    plt.ylabel('Frequency (Hz)')
    plt.title('Real Time Spectogram')
    plt.gca().invert_yaxis()
    #plt.colorbar() #enable if you want to display a color bar

    ############### Animate ###############
    anim = animation.FuncAnimation(fig,update_fig,blit = True,
                                interval=mic_read.CHUNK_SIZE/1000)

                                
    try:
        plt.show()
    except:
        print("Plot Closed")

    ############### Terminate ###############
    stream.stop_stream()
    stream.close()
    pa.terminate()
    print("Program Terminated")

if __name__ == "__main__":
    main()

I did run this experiment on Macbook Pro (2.2 GHz Quad-Core Intel Core i7) and MacOS Catalina 10.15.3.  The Python version used was Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 05:52:31)  [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin

Conclusions

This experiment demonstrates that it is possible to build a working real time Morse decoder based on deep learning Tensorflow model using a slow interpreted language like Python.  The approach taken here is quite simplistic and lacks some key functionality, such as alignment of decoded text to audio timeline.

It also shows that there are still more work to do in order to build a fully functioning, open source and high performance Morse decoder.  A better event driven software architecture would allow building a proper user interface with some controls, like audio filtering.   Such an architecture would enable also building server side decoders running based on audio feeds from WebSDR receivers etc.

Finally, the Tensorflow model in this experiment has a very small training set, only 27.8 hours of audio.  If you compare to commercial ASR (automatic speech recognition) engines they have been trained using over 1000X  more labeled audio training material.   To get better performance from deep learning models you need to have a lot of high quality labeled training material that matches with the typical sound environment the model will be used on.


73
Mauri AG1LE





9 comments:

  1. Dear Mauri,

    First, thanks for your very interisting and futuring work!
    I have a question, why use 128x32 image and why décode entière Word?

    I worked on a wideband CW 192khz I / Q decoder.
    What I get out of it is that it is largely possible to do:
    A FFT in real time.
    Detect peaks and therefore a probable cw emission.
    Recover an array of this FFT only on the "colone" of the signal therefore the equivalent of an image of 1px by x lines.
    In order to have signals that can really take shape, you have to do a sliding FFT on two buffers to combine the fairly fine resolution of a single CW signal probably 100hz wide and a resolution of the time evolution of at least 10ms per line ( or the equivalent of one half "dit" per line).
    I think that with one last trick to separate the characters and analyze each of the characters, your decoder will be operational.
    Starting from human analysis, we do this well to make our work easier ...

    73 F4HTB

    ReplyDelete
  2. Dear Mauri,

    First, thanks for your very interisting and futuring work!
    I have a question, why use 128x32 image and why décode entière Word?

    I worked on a wideband CW 192khz I / Q decoder.
    What I get out of it is that it is largely possible to do:
    A FFT in real time.
    Detect peaks and therefore a probable cw emission.
    Recover an array of this FFT only on the "colone" of the signal therefore the equivalent of an image of 1px by x lines.
    In order to have signals that can really take shape, you have to do a sliding FFT on two buffers to combine the fairly fine resolution of a single CW signal probably 100hz wide and a resolution of the time evolution of at least 10ms per line ( or the equivalent of one half "dit" per line).
    I think that with one last trick to separate the characters and analyze each of the characters, your decoder will be operational.
    Starting from human analysis, we do this well to make our work easier ...

    73 F4HTB

    ReplyDelete
  3. Hi Olivier
    > I have a question, why use 128x32 image and why décode entière Word?

    The network dimensions of the model can be changed. I kept the 128x32 of the original model (see https://ag1le.blogspot.com/2019/02/training-computer-to-listen-and-decode.html) because it allows to have multiple characters (or even words at higher CW speed) in the 4 second time window. If you look at the decoder code, there is actually 3 different decoders - BestPath, BeamSearch and WordBeamSearch. The latter two apply language model to improve detection of words in the corpus used for training. This corpus could include callsigns and other commonly used words in CW.

    > A FFT in real time.
    > Detect peaks and therefore a probable cw emission.

    This is exactly what this software is doing. See https://github.com/ag1le/deepmorse-decoder/blob/master/specgram.py#L71-L80 - this is where signal is converted to spectrogram using FFT. This line https://github.com/ag1le/deepmorse-decoder/blob/master/specgram.py#L142 detects the peak frequency of the signal and this line https://github.com/ag1le/deepmorse-decoder/blob/master/specgram.py#L146 takes +/- 16 Hz spectrum around the peak (so essentially a 32 Hz bandpass filter).

    > In order to have signals that can really take shape, you have to do a sliding FFT on two buffers to combine the fairly fine resolution of a single CW signal probably 100hz wide and a resolution of the time evolution of at least 10ms per line ( or the equivalent of one half "dit" per line).

    Take a look at these lines: https://github.com/ag1le/deepmorse-decoder/blob/master/specgram.py#L25-L27 - I am taking overlapping FFT frames and also padding (see https://github.com/ag1le/deepmorse-decoder/blob/master/specgram.py#L78) to get the required spectral and temporal resolution.

    > I think that with one last trick to separate the characters and analyze each of the characters, your decoder will be operational.

    I did some work over the weekend - these lines https://github.com/ag1le/deepmorse-decoder/blob/master/specgram.py#L83-L109 are looking near matches between consecutive frames and appending new characters as they get decoded.

    Thanks for your thoughtful feedback.

    br
    Mauri AG1LE


    ReplyDelete
  4. I just saw your answer by chance ... I don't know why I missed it.
    Thank you for your work, I will review this and follow your work closely.
    :) 73 F4HTB

    ReplyDelete
  5. Dear mauri,

    First of all, thank you for your article and I read it very interestingly.

    I have one question for you. One of the problems with the real-time decoder you mentioned in the text is the problem when the current image frame cuts the Morse character into parts. Have you solved this problem by any chance? Or have you tought about how to solve it?

    ReplyDelete
  6. Hi Unknown
    Thanks for your feedback. I have not solved the problem of current image frame cutting Morse character in parts yet. I have tested a few possible solutions: (1) make the image frame much longer so that one frame can contain many Morse characters. This way the overall decoding error rate is lower, as the problem appears less frequently. (2) Have overlapping image frames and use text post processing to find overlapping decoded characters and eliminate mis-decoded characters in between the frames. These methods provide some improvements but don't eliminate the problem.

    Do you have some other / better ideas?

    ReplyDelete
    Replies
    1. Thanks for your sincere reply. In addition to your solutions, I've also thought about learning a model that recognizes spaces between strings, but it's going to be too complicated.
      Speech recognition technologies are being commercialized these days, so there must be the simple solutions. I'll let you know after I find out.

      Thank you

      Delete
    2. Hi, I don't know if it could help, but maybe you could have an overlapping frame : if your frame is during 8 seconds, you could start the next 2 seconds before the end of the previous and decode the overlapping part twice. This way you may be certain that a character not ended in the first frame will be found in the next. To avoid duplicate characters, you could store an absolute starting time, if you get 2 characters starting at the time (and frequency, of course), you keep only the longer one.

      Delete