Sunday, February 10, 2019

Performance characteristics of the ML Morse Decoder





In my previous blog post I described a new Tensorflow based Machine Learning model that learns Morse code from annotated audio .WAV files with 8 kHz sample rate.

In order to evaluate the performance characteristic of  the decoding accuracy from noisy audio source files I created a set of training & validation materials with Signal-to-Noise Ratio from -22dB to +30 dB.   Target SNR_dB was created using the following Python code:

        # Desired linear SNR
        SNR_linear = 10.0**(SNR_dB/10.0)

        # Measure power of signal - assume zero mean 
        power = morsecode.var()

        # Calculate required noise power for desired SNR
        noise_power = power/SNR_linear

        # Generate noise with calculated power (mu=0, sigma=1)
        noise = np.sqrt(noise_power)*np.random.normal(0,1,len(morsecode))

        # Add noise to signal
        morsecode = noise + morsecode

These audio .WAV files contain random words with maximum 5 characters - 5000 samples at each SNR level with  95% used for training and 5% for validation. The Morse speed in each audio sample was randomly selected from 30 WPM or 25 WPM.

The training was performed until 5 consecutive epochs did not improve the character error rate. The duration of these training sessions varied from 15 - 45 minutes on Macbook Pro with2.2 GHz Intel Core i7 CPU. 

I captured and plotted the Character Error Rate (CER) and Signal-to-Noise Ratio (SNR) of each completed training and validation session.   The following graph shows that the Morse decoder performs quite well until about -12 dB SNR level and below that the decoding accuracy drops fairly dramatically.

CER vs. SNR graph





























To view how noisy these files are here are some random samples - first 4 seconds of 8KHz audio file is demodulated, filtered using  25Hz 3rd order Butterworth filter and decimated by 125 to fit into a (128,32) vector. These vectors is shown as grayscale images below:

-6 db SNR



-11 dB SNR


-13 dB SNR









-16 dB SNR








Conclusions

The Tensorflow model appears to perform quite well on decoding noisy audio files, at least when the training set and validation set have the same SNR level.  

The next experiments could include more variability with a much bigger training dataset that has a combination of different SNR, Morse speed and other variables.  The training duration depends on the amount of training data so it can take a while to perform these larger scale experiments on a home computer.  

73 de
Mauri AG1LE 



Saturday, February 2, 2019

Training a Computer to Listen and Decode Morse Code

Abstract

I trained  a Tensorflow based CNN-LSTM-CTC model  with 5.2 hours of Morse audio training set  (5000 files) and achieved character error rate of 0.1% and word accuracy of 99.5%  I tested the model with audio files containing various levels of noise and found the model to decode relatively accurately down to -3 dB SNR level. 

Introduction

 Decoding Morse code from audio signals is not a novel idea. The author has written many different software decoder implementations that use simplistic models to convert a sequence of "Dits" and "Dahs" to corresponding text.  When the audio signal is noise free and there is no interference,  these simplistic methods work fairly well and produce nearly error free decoding.  Figure 1. below shows "Hello World" with 35 dB signal-to-noise ratio that most conventional decoders don't have any problems decoding.

"Hello World" with 30 dB SNR 





Figure 2 below shows the same "Hello World" but with -12 dB signal-to-noise ratio using exactly same process as above to extract the demodulated envelope. Humans can still hear and even recognize the Morse code faintly in the noise. Computers equipped with these simplistic models have great difficulties decoding anything meaningful out of this signal.  In ham radio terms the difference of 47 dB corresponds roughly eight S units - human ears & brain can still decode S2 level signals whereas conventional software based Morse decoders produce mostly gibberish.

"Hello World" with -12 dB SNR 





New Approach - Machine Learning

I have been quite interested in Machine Learning (ML) technologies for a while.  From software development perspective ML is changing the paradigm how we are processing data.

In traditional programming we look at the input data and try to write a program that uses some processing steps to come up with the output data. Depending on the complexity of the problem software developer may need to spend quite a long time coming up with the correct algorithms to produce the right output data.  From Morse decoder perspective this is how most decoders work:  they take input audio data that contains the Morse signals and after many complex operations the correct decoded text appears on the screen. 

Machine Learning changes this paradigm. As a ML engineer you need to curate a dataset that has a representative selection of input data with corresponding output data (also known as label data).  The computer then applies a training algorithm to this dataset that eventually discovers the correct "program" - the ML model that provides the best matching  function that can infer the correct output, given the input data.

See Figure 3. that tries to depict this difference between traditional programming and the new approach with Machine Learning.
Programming vs. Machine Learning
























So what does this new approach mean in practice?  Instead of trying to figure out ever more complex software algorithms to improve your data processing and accuracy of decoding,  you can select from some standard machine learning algorithms that are available in open source packages like Tensorflow and focus on building a neural network model and curating a large dataset to train this model. The trained model can then be used to make the decoding from the input audio data. This is exactly what I did in the following experiment.

I took a Tensorflow implementation of Handwritten Text Recognition created by Harald Scheidl [3] that he has posted in Github as an open source project.  He has provided excellent documentation on how the model works as well as references to the IAM dataset that he is using for training the handwritten text recognition.

Why would a model created for  handwritten text recognition work for Morse code recognition?

It turns out that the Tensorflow standard learning algorithms used for handwriting recognition are very similar to ones used for speech recognition.

The figures  below are from Hannun, "Sequence Modeling with CTC", Distill, 2017. In the article Hannun [2] shows that the (x,y) coordinates of a pen stroke or pixels in image can be recognized as text, like the spectrogram of speech audio signals.  Morse code has similar properties as speech - the speed can vary a lot and hand-keyed code can have unique rhythm patterns that make it difficult to align signals to decoded text. The common theme is that we have some variable length input data that need to be aligned with variable length output data.  The algorithm that comes with Tensorflow is called Connectionist Temporal Classification (CTC) [1].


 

Morse Dataset

The Morse code audio file can be easily converted to a representation that is suitable as input data for these neural networks.  I am using single track (mono) WAV files with 8 kHz sampling frequency.

The following few lines of Python code takes 4 seconds sample from an existing WAV audio file, finds the signal peak frequency, de-modulates and decimates the data so that we get a (1,256) vector that we re-shape to (128, 32) and write into a PNG file.

def find_peak(fname):
    # Find the signal frequency and maximum value
    Fs, x = wavfile.read(fname)
    f,s = periodogram(x, Fs,'blackman',8192,'linear', False, scaling='spectrum')
    threshold = max(s)*0.9  # only 0.4 ... 1.0 of max value freq peaks included
    maxtab, mintab = peakdet(abs(s[0:int(len(s)/2-1)]), threshold,f[0:int(len(f)/2-1)] )

    return maxtab[0,0]

def demodulate(x, Fs, freq):
    # demodulate audio signal with known CW frequency 
    t = np.arange(len(x))/ float(Fs)
    mixed =  x*((1 + np.sin(2*np.pi*freq*t))/2 )

    #calculate envelope and low pass filter this demodulated signal
    #filter bandwidth impacts decoding accuracy significantly 
    #for high SNR signals 40 Hz is better, for low SNR 20Hz is better
    # 25Hz is a compromise - could this be made an adaptive value?
    low_cutoff = 25. # 25 Hz cut-off for lowpass
    wn = low_cutoff/ (Fs/2.)    
    b, a = butter(3, wn)  # 3rd order butterworth filter
    z = filtfilt(b, a, abs(mixed))
    
    # decimate and normalize
    decimate = int(Fs/64) # 8000 Hz / 64 = 125 Hz => 8 msec / sample 
    o = z[0::decimate]/max(z)
    return o

def process_audio_file(fname, x, y, tone):
    Fs, signal = wavfile.read(fname)
    dur = len(signal)/Fs
    o = demodulate(signal[(Fs*(x)):Fs*(x+y)], Fs, tone)
    return o, dur

filename = "error.wav"
tone = find_peak(filename)
o,dur = process_audio_file(filename,0,4, tone)
im = o[0::1].reshape(1,256)
im = im*256.

img = cv2.resize(im, (128, 32), interpolation = cv2.INTER_AREA)
cv2.imwrite("error.png",img)

Here is the resulting PNG image - it contains  "ERROR M". The labels are kept in a file that contains also the corresponding audio file name.

4 second audio sample converted to a (128,32) PNG file







It is very easy to produce a lot of training and validation data with this method. The important part is that each audio file must have accurate "labels" - this is the textual representation of the Morse audio file.

I created a small Python script to produce this kind of Morse training and validation dataset. With a few parameters you can generate as much  data as you want with different speed and noise levels.

Model

I used Harald's model to start the Morse decoding experiments. 

The model consists of 5 CNN layers, 2 RNN (LSTM) layers and the CTC loss and decoding layer. The illustration below gives an overview of the NN (green: operations, pink: data flowing through NN) and here follows a short description:
  • The input image is a gray-value image and has a size of 128x32
  • 5 CNN layers map the input image to a feature sequence of size 32x256
  • 2 LSTM layers with 256 units propagate information through the sequence and map the sequence to a matrix of size 32x80. Each matrix-element represents a score for one of the 80 characters at one of the 32 time-steps
  • The CTC layer either calculates the loss value given the matrix and the ground-truth text (when training), or it decodes the matrix to the final text with best path decoding or beam search decoding (when inferring)
  • Batch size is set to 50
















It is not hard to imagine making some changes to the model to allow for longer audio clips to be decoded. Right now the limit is about 4 seconds audio converted to (128x32) input image.  Harald is actually providing details of a model that can handle larger input image (800x64) and output up to 100 characters strings.

Experiment

Here are parameters I used for this experiment:

  • 5000 samples, split into training and validation set: 95% training - 5% validation
  • Each sample has 2 random words, max word length is 5 characters
  • Morse speed randomly selected from  [20, 25, 30] words-per-minute  
  • Morse audio SNR: 40 dB 
  • batchSize: 100  
  • imgSize: [128,32] 
  • maxTextLen: 32
  • earlyStopping: 20 

Training time  was 1hr 51mins  on a Macbook Pro 2.2 GHz Intel Core i7
Training curves of character error rate, word accuracy and loss after 50 epochs were the following:


Training over 50 epochs














The best character error rate was 14.9% and word accuracy was 36.0%.  These are not great numbers - the reason was that I had training data containing 2 words in each sample - in many cases this was too many characters to fit in the 4 second time window, therefore the training algorithm did not see the second word in the training material in many cases. 

I did re-run the experiment with 5000 samples, but with just one word in each sample.  It took 54 mins  7 seconds to do this training.  New parameters are below:

model:
    # model constants
    batchSize: 100  
    imgSize: !!python/tuple [128,32] 
    maxTextLen: 32
    earlyStopping: 5

morse:
    fnTrain:    "morsewords.txt"
    fnAudio:    "audio/"
    count:      5000
    SNR_dB:     
      - 20
      - 30
      - 40
    f_code:     600
    Fs:         8000
    code_speed: 
      - 30
      - 25
      - 20
    length_N:   65000
    play_sound: False
    word_max_length: 5
    words_in_sample: 1

experiment:
    modelDir:   "model/"
    fnAccuracy: "model/accuracy.txt"
    fnTrain:    "model/morsewords.txt"
    fnInfer:    "model/test.png"
    fnCorpus:   "model/corpus.txt"
    fnCharList: "model/charList.txt"


Here is the outcome of that second training session:

Total training time was 0:54:07.857731
Character error rate:  0.1%. Word accuracy: 99.5%.

Training over 33 epochs



With a larger dataset the training will take longer. One possibility would be to use AWS cloud computing service to accelerate the training for a much larger dataset. 

Note that the model did not know anything about Morse code at the start. It did learn the character set, the structure of the Morse code and the words just by "listening" through the provided sample files. This is approximately 5.3 hours of Morse code audio materials with random words.   (5000 files * 95% * 4 sec/file = 19000 seconds).  

It would be great to get some comparative data on how quickly humans will learn to produce similar character error rate. 

Results

I created a small "helloword.wav" audio file with HELLO WORLD text at 25 WPM in different signal-to-noise ratios (-6, -3, +6, +50) dB to test the first model. 

Attempting to decode the content of the audio file I got the following results.  Given that the training was done with +40 dB samples I was quite surprised to see relatively good decoding accuracy. The model also provides probability how confident it is about the result. These values vary between 0.4% to 5.7%. 


File: -6 dB SNR 
python MorseDecoder.py -f audio/helloworld.wav 
Validation character error rate of saved model: 15.4
Python: 2.7.10 (default, Aug 17 2018, 19:45:58) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)]
Tensorflow: 1.4.0
2019-02-02 22:40:51.970393: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Init with stored values from model/snapshot-22
inferBatch: probs:[ 0.00420194] texts:['HELL Q PE'] 
Recognized: "HELL Q PE"
Probability: 0.00420194

['HELL Q PE']

-6 dB HELLO WORLD











File: -3 dB SNR 
python MorseDecoder.py -f audio/helloworld.wav 
Validation character error rate of saved model: 15.4
Python: 2.7.10 (default, Aug 17 2018, 19:45:58) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)]
Tensorflow: 1.4.0
2019-02-02 22:36:32.838156: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Init with stored values from model/snapshot-22
inferBatch: probs:[ 0.05750186] texts:['HELLO WOE'] 
Recognized: "HELLO WOE"
Probability: 0.0575019

['HELLO WOE']
-3 dB HELLO WORLD







File: +6 dB SNR 
python MorseDecoder.py -f audio/helloworld.wav 
Validation character error rate of saved model: 15.4
Python: 2.7.10 (default, Aug 17 2018, 19:45:58) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)]
Tensorflow: 1.4.0
2019-02-02 22:38:57.549928: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Init with stored values from model/snapshot-22
inferBatch: probs:[ 0.03523131] texts:['HELLO WOT'] 
Recognized: "HELLO WOT"
Probability: 0.0352313
['HELLO WOT']

+6 dB HELLO WORLD





File: +50 dB SNR 
python MorseDecoder.py -f audio/helloworld.wav 
Validation character error rate of saved model: 15.4
Python: 2.7.10 (default, Aug 17 2018, 19:45:58) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)]
Tensorflow: 1.4.0
2019-02-02 22:42:55.403738: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
inferBatch: probs:[ 0.03296029] texts:['HELLO WOT'] 
Recognized: "HELLO WOT"
Probability: 0.0329603
['HELLO WOT']
+50 dB HELLO WORLD








In comparison, I took one file that was used in the training process. This file contains "HELLO HERO" text at +40 dB SNR. Here is what the decoder was able to decode - with much higher probability 51.8% 

File: +40 dB SNR 

python MorseDecoder.py -f audio/6e753ac57d4849ef87d5146e158610f0.wav
Validation character error rate of saved model: 15.4
Python: 2.7.10 (default, Aug 17 2018, 19:45:58) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)]
Tensorflow: 1.4.0
2019-02-02 22:53:27.029448: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Init with stored values from model/snapshot-22
inferBatch: probs:[ 0.51824665] texts:['HELLO HERO'] 
Recognized: "HELLO HERO"
Probability: 0.518247
['HELLO HERO']
+40 dB HELLO HERO

Conclusions

This is my first machine learning experiment where I used Morse audio files for both training and validation of the model.  The current model limitation is that only 4 second audio clips can be used.  However, it is very feasible to build a larger model that can decode longer audio clip with a single inference operation.  Also, it would be possible to feed a longer audio file in 4 second pieces to get decoding happening across the whole file.

This Morse decoder doesn't have a single line of code that would explicitly spell out the Morse codebook.  The model literally learned from the training data what Morse code is and how to decode it.  It represents a new paradigm in building decoders, and is using similar technology what companies like Google, Microsoft, Amazon and Apple are using for their speech recognition products.

I hope that this experiment demonstrates to the ham radio community how to build high quality, open source Morse decoders using a simple, standards based ML architecture.  With more computing capacity and larger training / validation datasets that contain accurate annotated (labeled) audio files  it is now feasible to build a decoder that will surpass the accuracy of conventional decoders (like the one in FLDIGI software).

73  de Mauri
AG1LE

Software and Instructions

The initial version of the software is available in Github - see here

Using from the command line:

python MorseDecoder.py -h
usage: MorseDecoder.py [-h] [--train] [--validate] [--generate] [-f FILE]

optional arguments:
  -h, --help  show this help message and exit
  --train     train the NN
  --validate  validate the NN
  --generate  generate a Morse dataset of random words
  -f FILE     input audio file


To get started you need to generate audio training material. The count variable in model.yaml config file tells how many samples will get generated. Default is 5000.

python MorseDecoder.py --generate


Next you need to perform the training. You need to have "audio/", "image/" and "model/" subdirectories on the folder you are running the program.

python MorseDecoder.py --train


Last this to do is to validate the model:

python MorseDecoder.py --validate

To have the model decode a file you should use:

python MorseDecoder.py -f audio/myfilename.wav 




Config file model.yaml  (first training session):
model:
    # model constants
    batchSize: 100  
    imgSize: !!python/tuple [128,32] 
    maxTextLen: 32
    earlyStopping: 20 

morse:
    fnTrain:    "morsewords.txt"
    fnAudio:    "audio/"
    count:      5000
    SNR_dB:     20
    f_code:     600
    Fs:         8000
    code_speed: 30
    length_N:   65000
    play_sound: False
    word_max_length: 5
    words_in_sample: 2

experiment:
    modelDir:   "model/"
    fnAccuracy: "model/accuracy.txt"
    fnTrain:    "model/morsewords.txt"
    fnInfer:    "model/test.png"
    fnCorpus:   "model/corpus.txt"
    fnCharList: "model/charList.txt"

Config file model.yaml  (second training session):
model:
    # model constants
    batchSize: 100  
    imgSize: !!python/tuple [128,32] 
    maxTextLen: 32
    earlyStopping: 5

morse:
    fnTrain:    "morsewords.txt"
    fnAudio:    "audio/"
    count:      5000
    SNR_dB:     
      - 20
      - 30
      - 40
    f_code:     600
    Fs:         8000
    code_speed: 
      - 30
      - 25
      - 20
    length_N:   65000
    play_sound: False
    word_max_length: 5
    words_in_sample: 1

experiment:
    modelDir:   "model/"
    fnAccuracy: "model/accuracy.txt"
    fnTrain:    "model/morsewords.txt"
    fnInfer:    "model/test.png"
    fnCorpus:   "model/corpus.txt"
    fnCharList: "model/charList.txt"

References

[1]  A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376. https://www.cs.toronto.edu/~graves/icml_2006.pdf
[2]  Hannun, "Sequence Modeling with CTC", Distill, 2017.  https://distill.pub/2017/ctc/
[3] Harald Scheidl "Handwritten Text Recognition with TensorFlow", https://github.com/githubharald/SimpleHTR