Abstract
I trained a Tensorflow based CNN-LSTM-CTC model with 5.2 hours of Morse audio training set (5000 files) and achieved character error rate of 0.1% and word accuracy of 99.5% I tested the model with audio files containing various levels of noise and found the model to decode relatively accurately down to -3 dB SNR level.
Introduction
Decoding Morse code from audio signals is not a novel idea. The author has written many different software decoder implementations that use simplistic models to convert a sequence of "Dits" and "Dahs" to corresponding text. When the audio signal is noise free and there is no interference, these simplistic methods work fairly well and produce nearly error free decoding. Figure 1. below shows "Hello World" with 35 dB signal-to-noise ratio that most conventional decoders don't have any problems decoding.
|
"Hello World" with 30 dB SNR |
Figure 2 below shows the same "Hello World" but with -12 dB signal-to-noise ratio using exactly same process as above to extract the demodulated envelope. Humans can still hear and even recognize the Morse code faintly in the noise. Computers equipped with these simplistic models have great difficulties decoding anything meaningful out of this signal. In ham radio terms the difference of 47 dB corresponds roughly eight S units - human ears & brain can still decode S2 level signals whereas conventional software based Morse decoders produce mostly gibberish.
|
"Hello World" with -12 dB SNR |
New Approach - Machine Learning
I have been quite interested in Machine Learning (ML) technologies for a while. From software development perspective ML is changing the paradigm how we are processing data.
In traditional programming we look at the input data and try to write a program that uses some processing steps to come up with the output data. Depending on the complexity of the problem software developer may need to spend quite a long time coming up with the correct algorithms to produce the right output data. From Morse decoder perspective this is how most decoders work: they take input audio data that contains the Morse signals and after many complex operations the correct decoded text appears on the screen.
Machine Learning changes this paradigm. As a ML engineer you need to curate a dataset that has a representative selection of input data with corresponding output data (also known as label data). The computer then applies a training algorithm to this dataset that eventually discovers the correct "program" - the ML model that provides the best matching function that can infer the correct output, given the input data.
See Figure 3. that tries to depict this difference between traditional programming and the new approach with Machine Learning.
|
Programming vs. Machine Learning |
So what does this new approach mean in practice? Instead of trying to figure out ever more complex software algorithms to improve your data processing and accuracy of decoding, you can select from some standard machine learning algorithms that are available in open source packages like Tensorflow and focus on building a neural network model and curating a large dataset to train this model. The trained model can then be used to make the decoding from the input audio data. This is exactly what I did in the following experiment.
I took a Tensorflow implementation of
Handwritten Text Recognition created by Harald Scheidl [3] that he has posted in Github as an open source project. He has provided excellent documentation on how the model works as well as references to
the IAM dataset that he is using for training the handwritten text recognition.
Why would a model created for handwritten text recognition work for Morse code recognition?
It turns out that the Tensorflow standard learning algorithms used for handwriting recognition are very similar to ones used for speech recognition.
The figures below are from
Hannun, "Sequence Modeling with CTC", Distill, 2017. In the article Hannun [2] shows that the (x,y) coordinates of a pen stroke or pixels in image can be recognized as text, like the spectrogram of speech audio signals. Morse code has similar properties as speech - the speed can vary a lot and hand-keyed code can have unique rhythm patterns that make it difficult to align signals to decoded text. The common theme is that we have some variable length input data that need to be aligned with variable length output data. The algorithm that comes with Tensorflow is called Connectionist Temporal Classification (CTC) [1].
Morse Dataset
The Morse code audio file can be easily converted to a representation that is suitable as input data for these neural networks. I am using single track (mono) WAV files with 8 kHz sampling frequency.
The following few lines of Python code takes 4 seconds sample from an existing WAV audio file, finds the signal peak frequency, de-modulates and decimates the data so that we get a (1,256) vector that we re-shape to (128, 32) and write into a PNG file.
def find_peak(fname):
# Find the signal frequency and maximum value
Fs, x = wavfile.read(fname)
f,s = periodogram(x, Fs,'blackman',8192,'linear', False, scaling='spectrum')
threshold = max(s)*0.9 # only 0.4 ... 1.0 of max value freq peaks included
maxtab, mintab = peakdet(abs(s[0:int(len(s)/2-1)]), threshold,f[0:int(len(f)/2-1)] )
return maxtab[0,0]
def demodulate(x, Fs, freq):
# demodulate audio signal with known CW frequency
t = np.arange(len(x))/ float(Fs)
mixed = x*((1 + np.sin(2*np.pi*freq*t))/2 )
#calculate envelope and low pass filter this demodulated signal
#filter bandwidth impacts decoding accuracy significantly
#for high SNR signals 40 Hz is better, for low SNR 20Hz is better
# 25Hz is a compromise - could this be made an adaptive value?
low_cutoff = 25. # 25 Hz cut-off for lowpass
wn = low_cutoff/ (Fs/2.)
b, a = butter(3, wn) # 3rd order butterworth filter
z = filtfilt(b, a, abs(mixed))
# decimate and normalize
decimate = int(Fs/64) # 8000 Hz / 64 = 125 Hz => 8 msec / sample
o = z[0::decimate]/max(z)
return o
def process_audio_file(fname, x, y, tone):
Fs, signal = wavfile.read(fname)
dur = len(signal)/Fs
o = demodulate(signal[(Fs*(x)):Fs*(x+y)], Fs, tone)
return o, dur
filename = "error.wav"
tone = find_peak(filename)
o,dur = process_audio_file(filename,0,4, tone)
im = o[0::1].reshape(1,256)
im = im*256.
img = cv2.resize(im, (128, 32), interpolation = cv2.INTER_AREA)
cv2.imwrite("error.png",img)
Here is the resulting PNG image - it contains "ERROR M". The labels are kept in a file that contains also the corresponding audio file name.
|
4 second audio sample converted to a (128,32) PNG file |
It is very easy to produce a lot of training and validation data with this method. The important part is that each audio file must have accurate "labels" - this is the textual representation of the Morse audio file.
I created a small Python script to produce this kind of Morse training and validation dataset. With a few parameters you can generate as much data as you want with different speed and noise levels.
Model
I used Harald's model to start the Morse decoding experiments.
The model consists of 5 CNN layers, 2 RNN (LSTM) layers and the CTC loss and decoding layer. The illustration below gives an overview of the NN (green: operations, pink: data flowing through NN) and here follows a short description:
- The input image is a gray-value image and has a size of 128x32
- 5 CNN layers map the input image to a feature sequence of size 32x256
- 2 LSTM layers with 256 units propagate information through the sequence and map the sequence to a matrix of size 32x80. Each matrix-element represents a score for one of the 80 characters at one of the 32 time-steps
- The CTC layer either calculates the loss value given the matrix and the ground-truth text (when training), or it decodes the matrix to the final text with best path decoding or beam search decoding (when inferring)
- Batch size is set to 50
It is not hard to imagine making some changes to the model to allow for longer audio clips to be decoded. Right now the limit is about 4 seconds audio converted to (128x32) input image. Harald is actually providing
details of a model that can handle larger input image (800x64) and output up to 100 characters strings.
Experiment
Here are parameters I used for this experiment:
- 5000 samples, split into training and validation set: 95% training - 5% validation
- Each sample has 2 random words, max word length is 5 characters
- Morse speed randomly selected from [20, 25, 30] words-per-minute
- Morse audio SNR: 40 dB
- batchSize: 100
- imgSize: [128,32]
- maxTextLen: 32
- earlyStopping: 20
Training time was 1hr 51mins on a Macbook Pro 2.2 GHz Intel Core i7
Training curves of character error rate, word accuracy and loss after 50 epochs were the following:
|
Training over 50 epochs |
The best character error rate was 14.9% and word accuracy was 36.0%. These are not great numbers - the reason was that I had training data containing 2 words in each sample - in many cases this was too many characters to fit in the 4 second time window, therefore the training algorithm did not see the second word in the training material in many cases.
I did re-run the experiment with 5000 samples, but with just one word in each sample. It took 54 mins 7 seconds to do this training. New parameters are below:
model:
# model constants
batchSize: 100
imgSize: !!python/tuple [128,32]
maxTextLen: 32
earlyStopping: 5
morse:
fnTrain: "morsewords.txt"
fnAudio: "audio/"
count: 5000
SNR_dB:
- 20
- 30
- 40
f_code: 600
Fs: 8000
code_speed:
- 30
- 25
- 20
length_N: 65000
play_sound: False
word_max_length: 5
words_in_sample: 1
experiment:
modelDir: "model/"
fnAccuracy: "model/accuracy.txt"
fnTrain: "model/morsewords.txt"
fnInfer: "model/test.png"
fnCorpus: "model/corpus.txt"
fnCharList: "model/charList.txt"
Here is the outcome of that second training session:
Total training time was 0:54:07.857731
Character error rate: 0.1%. Word accuracy: 99.5%.
|
Training over 33 epochs |
With a larger dataset the training will take longer. One possibility would be to use AWS cloud computing service to accelerate the training for a much larger dataset.
Note that the model did not know anything about Morse code at the start. It did learn the character set, the structure of the Morse code and the words just by "listening" through the provided sample files. This is approximately 5.3 hours of Morse code audio materials with random words. (5000 files * 95% * 4 sec/file = 19000 seconds).
It would be great to get some comparative data on how quickly humans will learn to produce similar character error rate.
Results
I created a small "helloword.wav" audio file with HELLO WORLD text at 25 WPM in different signal-to-noise ratios (-6, -3, +6, +50) dB to test the first model.
Attempting to decode the content of the audio file I got the following results. Given that the training was done with +40 dB samples I was quite surprised to see relatively good decoding accuracy. The model also provides probability how confident it is about the result. These values vary between 0.4% to 5.7%.
File: -6 dB SNR
python MorseDecoder.py -f audio/helloworld.wav
Validation character error rate of saved model: 15.4
Python: 2.7.10 (default, Aug 17 2018, 19:45:58)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)]
Tensorflow: 1.4.0
2019-02-02 22:40:51.970393: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Init with stored values from model/snapshot-22
inferBatch: probs:[ 0.00420194] texts:['HELL Q PE']
Recognized: "HELL Q PE"
Probability: 0.00420194
['HELL Q PE']
|
-6 dB HELLO WORLD |
File: -3 dB SNR
python MorseDecoder.py -f audio/helloworld.wav
Validation character error rate of saved model: 15.4
Python: 2.7.10 (default, Aug 17 2018, 19:45:58)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)]
Tensorflow: 1.4.0
2019-02-02 22:36:32.838156: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Init with stored values from model/snapshot-22
inferBatch: probs:[ 0.05750186] texts:['HELLO WOE']
Recognized: "HELLO WOE"
Probability: 0.0575019
['HELLO WOE']
|
-3 dB HELLO WORLD |
File: +6 dB SNR
python MorseDecoder.py -f audio/helloworld.wav
Validation character error rate of saved model: 15.4
Python: 2.7.10 (default, Aug 17 2018, 19:45:58)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)]
Tensorflow: 1.4.0
2019-02-02 22:38:57.549928: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Init with stored values from model/snapshot-22
inferBatch: probs:[ 0.03523131] texts:['HELLO WOT']
Recognized: "HELLO WOT"
Probability: 0.0352313
['HELLO WOT']
|
+6 dB HELLO WORLD |
File: +50 dB SNR
python MorseDecoder.py -f audio/helloworld.wav
Validation character error rate of saved model: 15.4
Python: 2.7.10 (default, Aug 17 2018, 19:45:58)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)]
Tensorflow: 1.4.0
2019-02-02 22:42:55.403738: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
inferBatch: probs:[ 0.03296029] texts:['HELLO WOT']
Recognized: "HELLO WOT"
Probability: 0.0329603
['HELLO WOT']
|
+50 dB HELLO WORLD |
In comparison, I took one file that was used in the training process. This file contains "HELLO HERO" text at +40 dB SNR. Here is what the decoder was able to decode - with much higher probability 51.8%
File: +40 dB SNR
python MorseDecoder.py -f audio/6e753ac57d4849ef87d5146e158610f0.wav
Validation character error rate of saved model: 15.4
Python: 2.7.10 (default, Aug 17 2018, 19:45:58)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)]
Tensorflow: 1.4.0
2019-02-02 22:53:27.029448: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Init with stored values from model/snapshot-22
inferBatch: probs:[ 0.51824665] texts:['HELLO HERO']
Recognized: "HELLO HERO"
Probability: 0.518247
['HELLO HERO']
|
+40 dB HELLO HERO |
Conclusions
This is my first machine learning experiment where I used Morse audio files for both training and validation of the model. The current model limitation is that only 4 second audio clips can be used. However, it is very feasible to build a larger model that can decode longer audio clip with a single inference operation. Also, it would be possible to feed a longer audio file in 4 second pieces to get decoding happening across the whole file.
This Morse decoder doesn't have a single line of code that would explicitly spell out the Morse codebook. The model literally learned from the training data what Morse code is and how to decode it. It represents a new paradigm in building decoders, and is using similar technology what companies like Google, Microsoft, Amazon and Apple are using for their speech recognition products.
I hope that this experiment demonstrates to the ham radio community how to build high quality, open source Morse decoders using a simple, standards based ML architecture. With more computing capacity and larger training / validation datasets that contain accurate annotated (labeled) audio files it is now feasible to build a decoder that will surpass the accuracy of conventional decoders (like the one in FLDIGI software).
73 de Mauri
AG1LE
Software and Instructions
The initial version of the software is available in Github - see
here
Using from the command line:
python MorseDecoder.py -h
usage: MorseDecoder.py [-h] [--train] [--validate] [--generate] [-f FILE]
optional arguments:
-h, --help show this help message and exit
--train train the NN
--validate validate the NN
--generate generate a Morse dataset of random words
-f FILE input audio file
To get started you need to generate audio training material. The count variable in model.yaml config file tells how many samples will get generated. Default is 5000.
python MorseDecoder.py --generate
Next you need to perform the training. You need to have "audio/", "image/" and "model/" subdirectories on the folder you are running the program.
python MorseDecoder.py --train
Last this to do is to validate the model:
python MorseDecoder.py --validate
To have the model decode a file you should use:
python MorseDecoder.py -f audio/myfilename.wav
Config file model.yaml (first training session):
model:
# model constants
batchSize: 100
imgSize: !!python/tuple [128,32]
maxTextLen: 32
earlyStopping: 20
morse:
fnTrain: "morsewords.txt"
fnAudio: "audio/"
count: 5000
SNR_dB: 20
f_code: 600
Fs: 8000
code_speed: 30
length_N: 65000
play_sound: False
word_max_length: 5
words_in_sample: 2
experiment:
modelDir: "model/"
fnAccuracy: "model/accuracy.txt"
fnTrain: "model/morsewords.txt"
fnInfer: "model/test.png"
fnCorpus: "model/corpus.txt"
fnCharList: "model/charList.txt"
Config file model.yaml (second training session):
model:
# model constants
batchSize: 100
imgSize: !!python/tuple [128,32]
maxTextLen: 32
earlyStopping: 5
morse:
fnTrain: "morsewords.txt"
fnAudio: "audio/"
count: 5000
SNR_dB:
- 20
- 30
- 40
f_code: 600
Fs: 8000
code_speed:
- 30
- 25
- 20
length_N: 65000
play_sound: False
word_max_length: 5
words_in_sample: 1
experiment:
modelDir: "model/"
fnAccuracy: "model/accuracy.txt"
fnTrain: "model/morsewords.txt"
fnInfer: "model/test.png"
fnCorpus: "model/corpus.txt"
fnCharList: "model/charList.txt"
References
[1] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of
the 23rd international conference on Machine learning. ACM,
2006, pp. 369–376. https://www.cs.toronto.edu/~graves/icml_2006.pdf
[2] Hannun, "Sequence Modeling with CTC", Distill, 2017. https://distill.pub/2017/ctc/
[3] Harald Scheidl "Handwritten Text Recognition with TensorFlow", https://github.com/githubharald/SimpleHTR