Sunday, July 4, 2021

Cloudberry Live - listen your rig from everywhere with Web UI using Raspberry Pi

Overview 

I wanted to have a fully open source solution to listen my radio(s) from my mobile phone and laptop over the web using a low cost Raspberry Pi as the rig control server.  While there are many different remote station solutions out there I could not find one that would just work with a normal web browser (Chrome, Safari, etc) and without doing complicated network configurations exposing your internal  WiFi network via a router.  Also,  I wanted to have the solution that is really easy to install to Raspberry Pi and update new versions as new features get added to the software.  

I revisited the KX3 remote control project I did in Feb 2016 and started a new Cloudberry Live project. Cloudberry Live has several new improvements, such as no need to install Mumble client on your devices - you can just listen your radio(s) using a regular web browser.  I did also upgrade my Amazon Alexa skill to leverage the ability to stream audio to Amazon Echo devices and control the frequency using voice commands. 

Here is a short demo video how Cloudberry.live works: 






Features 

  • Listen your radio using web streaming from anywhere.
  • Web UI that works with mobile,  tablet and laptop browsers (Chrome and Safari tested) 
  • View top 10 DX cluster spots, switch the radio to the frequency with one click. 
The software is currently at alpha stage - all the parts are working as shown in the demo above but need refactoring and general clean-up.   The cloudberry.live proxy service is currently using a 3rd party open source proxy provider jprq.  My plan is to host a reverse proxy myself in order to simplify the installation process.  

The software is written using Python Flask framework and bash scripts. The deployment to Raspberry Pi is done using Ansible playbook that configures the environment correctly.  I am using NGINX webserver to serve the web application.  

The audio streaming portion is using HTTP Live Streaming (HLS) protocol and ffmpeg is used to stream audio from ALSA port and encode it using AAC format. There is a python http.server on port 8000 serving HLS traffic.  I have tested Safari and Chrome browsers to be able to stream HLS audio.  Chrome requires Play HLS M3u8 extension to be installed.

The home screen is shown below.  This gives you the top 10 spots and a link to open audio streaming window.  By clicking the frequency link on the freq column the server sends hamlib commands to the radio to set the frequency and mode.  Only USB and LSB modes are supported in the current software version.



The Tune screen is shown below.  This is still works-in-progress and needs some polishing.  The Select Frequency allows to enter the frequency using numbers. The VFO range bar allows to change the radio frequency by dragging the green selection bar.   The band selection buttons don't do anything at the moment. 



The Configure Rig screen allows you to select your rig from the list of hamblib supported radios. I am using ICOM IC-7300 that is currently the default setting. 


The Search button on the menu bar allows to check call sign from hamdb.org database. A pop-up window will show the station details:




Amazon Alexa Skill

I created a new Alexa Skill  Cloudberry Live  (not published yet) that uses the web API interface for selecting the frequency based on DX cluster spots and the HLS streaming to listen your radio.  While the skill is currently using only my station,  my goal would be to implement some sort of registration process so that Alexa users  would have more choice to listen ham radio traffic from DX stations around the world using Cloudberry.live software. 

This would give an opportunity also for people with disabilities to enjoy listening HF bands using voice controlled, low cost ($20 - $35) smart speakers.  By keeping your radio (Raspberry Pi server) online you could help to grow the ham community. 

Installation

I have posted the software to Github in a private repo.  The software will have the following key features
  • One step software installation to Raspberry Pi using Ansible playbooks.
  • Configure your radio using Hamlib 
  • Get your personalized Cloudberry.live weblink
I have been developing cloudberry.live  on my Macbook Pro and pushing new versions to RaspBerry Pi server downstairs where my IC-7300 is located. Typical  Ansible playbook update takes about 32 seconds (this includes restarting the services).  I can see the access and error logs on the server using SSH consoles - this makes debugging quite easy.  

Questions? 

I am looking for collaborators to work with me on this project.  If you are interested in open source web development using Python Flask framework let me know by posting a comment below. 


73  de 
Mauri AG1LE


Wednesday, March 31, 2021

New exciting Digital Mode CT8 for ham radio communications

April 1, 2021 

Overview 

CT8 is a new exciting digital mode designed for interactive ham radio communication where signals may be weak and fading, and openings may be short.  

A beta release of CT8 offers sensitivity down to –48 dB on the AWGN channel, and DX contacts with 4 times longer distance than FT8. An auto-sequencing feature offers the option to respond automatically to the first decoded reply to your CQ. 

The best part of this new mode is that it is easy to learn how to decode in your head, thus no decoder software is needed. Alpha users of CT8 mode report that learning to decode CT8 is ten times easier than Morse code.  For those who rather use a computer, an open source Tensorflow based Machine Learning decoder software is included in this beta release. 

CT8 is based on novel avian vocalization encoding scheme.  The character combinations were designed to be very easily recognizable to leverage existing QSO practices in the communication modes like CW.  

Below is an example audio clip on how to establish a CT8 contact - the message format should be familiar  to anybody who have listened Morse code in ham radio bands before.  

Listen to the "CQ CQ DE AG1LE K"  - the audio has rich syllabic tonal and harmonic features that are very easy to recognize even under noisy band conditions. 

Fig 1. below shows the corresponding spectrogram. Notice the harmonic spectral features that ensure accurate symbol decoding and provide high sensitivity and tolerance against rapid fading, flutter and QRM.

Fig 1. CT8 spectrogram - CQ CQ CQ DE AG1LE K







The audio clip sample may sound a bit like a chicken.  This is actually a key feature of avian vocalization encoding.   

Scientific Background 

The idea behind CT8 mode is not new.  There is a lot of research done on avian vocalizations over the past hundred years. From late 1990s digital signal processing software has become widely available and vocal signals can be analyzed using sonograms and spectrograms with a personal computer.

In research article [1] Dr.  Nicholas Collias  described sound spectrograms of 21 of the 26 vocal signals in the extensive vocal repertoire of the African Village Weaver (Ploceus cucullatus). A spectrographic key to vocal signals helps make these signals comparable for different investigators. Short-distance contact calls are given in favorable situations and are generally characterized by low amplitude and great brevity of notes. Alarm cries are longer, louder, and often strident calls with much energy at high frequencies, whereas threat notes, also relatively long and harsh, emphasize lower frequencies. 

In a very interesting research article [2] by Kevin G. McCracken and Frederick H. Sheldon conclude that the characters most subject to ecological convergence, and thus of least phylogenetic value, are first peak-energy frequency and frequency range, because sound penetration through vegetation depends largely on frequency. The most phylogenetically informative characters are number of syllables, syllable structure, and fundamental frequency, because these are more reflective of behavior and syringeal structure. In the figure below give details about Heron phylogeny, corresponding spectrograms, vocal characters, and habitat distributions. 




Habitat distributions suggest that avian species that inhabit open areas such as savannas, grasslands, and open marshes have higher peak-energy (J) frequencies (kHz) and broader frequency ranges (kHz) than do taxa inhabiting closed habitats such as forests. Number of syllables is the number most frequently produced. 

Ibises, tiger-herons, and boat-billed herons emit a rapid series of similar syllables; other heron vocalizations generally consist of singlets, doublets, or triplets. Syllabic structure may be tonal (i.e., pure whistled notes) or harmonic (i.e., possessing overtones; integral multiples of the base frequency). Fundamental frequency (kHz) is the base frequency of a syllable and is a function of syringeal morphology. 

These vocalization features can be used for training modern machine learning algorithms. In fact, in a series of studies published [3] between 2014 and 2016, Georgia Tech research engineer Wayne Daley and his colleagues exposed groups of six to 12 broiler chickens to moderately stressful situations—such as high temperatures, increased ammonia levels in the air and mild viral infections—and recorded their vocalizations with standard USB microphones. They then fed the audio into a machine learning program, training it to recognize the difference between the sounds of contented and distressed birds.   According the Scientific American article [4] Carolynn “K-lynn” Smith, a biologist at Macquarie University in Australia and a leading expert on chicken vocalizations, says that although the studies published so far are small and preliminary, they are “a neat proof of concept” and “a really fascinating approach.”

What does CT8 stand for? 

Building on this solid scientific foundation it is easy to imagine very effective communication protocols that are based on millions of years of evolution of various avian species. After all,  birds are social animals and have very expressive and effective communication protocols, whether to warn others about approaching predator  or to invite flock members to join feasting on a corn field.  

Humans have domesticated several avian species and have been living with species like chicken (Gallus gallus domesticus) for over 8000 years.  Therefore CT8 mode sounds inherently natural to humans and it is much easier to learn to decode than Morse code based on extensive alpha testing performed by the development team.  


CT8 stands for "Chicken Talk" version 8  -- over a year of development effort and seven previous encoding versions tested over difficult band conditions, and with hundreds of Machine Learning models trained, the software development team has finally been able to release CT8  digital mode. 

Encoding Scheme 

From ham radio perspective the frequency range of these avian vocalizations is below 4 kHz in most cases.  This makes it possible to use existing SSB or FM transceivers without any modifications, other than perhaps adjustment of the filter bandwidth available in modern rigs.  The audio sampling rate used in this project was 8 kHz, so the original audio source files  were re-sampled  using a Linux command line tool: 

sox  -b16 -c 1 input.wav output.wav  rate 8000

The encoding scheme for the CT8 mode was done by collecting various free audio sources of chicken sounds and carefully assembling vowels, plosives, fricatives and nasals using this resource as the model. Free open source cross-platform audio software Audacity was used to extract vocalizations using the spectrogram view and also creating labeled audio files.

Figure 3. below shows a sample audio file with assigned character labels. 

Fig 3. Labeled vocalizations using Audacity software











CT8 Software

The encoder software is written in C++ and Python and runs on Windows, OSX, and Linux.  The sample decoder is made available from Github as open source software, if there is enough interest on this novel communication mode from the ham radio community.    

For the CT8 decoder a  Machine Learning based decoder  software was built on top of open source Tensorflow framework.  The decoder was trained on short 4 second audio clips and in the experiments character error rate  0.1% and word accuracy of 99.5% was achieved.  With more real-world training material the ML model is expected to achieve even better decoding accuracy. 

Future Enhancements

CT8  opens a new era for ham radio communication protocol development using biomimetics principles.  Adding new phonemes using the principles  of ecological signals as described in article [2]  can open up things like "DX mode" for long distance communication.  For example the vocalizations of Cetaceans (whales) could be also used to build a new phoneme map for DX contacts - some of the lowest frequency whale sounds can travel through the ocean as far as 10,000 miles without losing their energy.  


73  de AG1LE 


PS. If you made it down here, I hope that you enjoyed this figment of my imagination and I wish you a very happy April 1st.


References

[1] Nicholas E. Collias,  Vocal Signals of the Village Weaver: A Spectrographic Key and the Communication Code

[2]  Kevin G. McCracken and Frederick H. Sheldon, Avian vocalizations and phylogenetic signal

[3] Wayne Daley, et al Identifying rale sounds in chickens using audio signals for early disease detection in poultry

[4] Scientific American, Ferris Jabr, Fowl Language: AI Decodes the Nuances of Chicken “Speech”


Sunday, April 12, 2020

New real-time deep learning Morse decoder


Introduction

I have done some experiments with deep learning models previously. This previous blog post  covers the new approach of building Morse decoder by training a CNN-LSTM-CTC model using audio that is converted to small image frames.

In this latest experiment I trained  a new Tensorflow based CNN-LSTM-CTC model  using 27.8 hours of Morse audio training set  (25,000 WAV files - each clip 4 seconds) and achieved character error rate of 1.5% and word accuracy of 97.2% after 2:29:19 training time. The training data corpus was created from ARRL Morse code practice files (text files).

New real-time deep learning Morse decoder

I wanted to see if this new model is capable of decoding audio in real-time so I wrote a simple Python script to listen microphone, create a spectrogram, detect the CW frequency automatically, and feed 128 x 32 images to the model to perform the decoding inference.

With some tuning of the various components and parameters I was able to put together a working prototype using standard Python libraries and the Tensorflow Morse decoder that is available as open source in Github.

I recorded this sample YouTube video below in order to document this experiment.

Starting from the top left I have FLDIGI  window open decoding CW at 30 WPM speed. On the top middle I have console window open printing the frame number, CW tone frequency followed by "infer_image:" and decoded text as well as the probability that the model assigns to this result.

On the top right I have the Spectrogram window that plots 4 seconds of the audio on a frequency scale.  The morse code is quite readable on this graph.

On the bottom left I have Audacity  playing a sample 30 WPM practice file from ARRL. Finally, on the bottom right I have the 128x32 image frame that I am feeding to the model.





Analysis

The full text at 30 WPM is here - I have highlighted the text section that is playing in the above video clip.

�  NOW 30 WPM  �  TEXT IS FROM JULY 2015 QST  PAGE 99 �

AGREEMENT WITH SOUTHCOM GRANTED ATLAS ACCESS TO THE SC 130S TECHNOLOGY.
THE ATLAS 180 ADAPTED THE MAN PACK RADIOS DESIGN FOR AMATEUR USE.  AN
ANALOG VFO FOR THE 160, 80, 40, AND 20 METER BANDS REPLACED THE SC 130S
STEP TUNED 2 12 MHZ SYNTHESIZER.  OUTPUT POWER INCREASED FROM 20 W TO 100
W.  AMONG THE 180S CHARMS WAS ITS SIZE.  IT MEASURED 9R5 X 9R5 X 3 INCHES.
THATS NOTHING SPECIAL TODAY, BUT IT WAS A TINY RIG IN 1974.  THE FULLY
SOLID STATE TRANSCEIVER FEATURED NO TUNE OPERATION.  THE VFOS 350 KHZ RANGE
REQUIRED TWO BAND SWITCH SEGMENTS TO COVER 75/80 METERS, BUT WAS AMPLE FOR
THE OTHER BANDS.  IN ORDER TO IMPROVE IMMUNITY TO OVERLOAD AND CROSS
MODULATION, THE 180S RECEIVER HAD NO RF AMPLIFIER STAGE THE ANTENNA INPUT
CIRCUIT FED THE RADIOS MIXER DIRECTLY.  A PAIR OF SUCCESSORS EARLY IN 1975,
ATLAS INTRODUCED THE 180S SUCCESSOR IN REALITY, A PAIR OF THEM.  THE NEW
210 COVERED 80 10 METERS, WHILE THE OTHERWISE IDENTICAL 215 COVERED 160 15
METERS HEREAFTER, WHEN THE 210 SERIES IS MENTIONED, THE 215 IS ALSO
IMPLIED.  BECAUSE THE 210 USED THE SAME VFO AND BAND SWITCH AS THE 180,
SQUEEZING IN FIVE BANDS SACRIFICED PART OF 80 METERS.  THAT BAND STARTED AT
�  END OF 30 WPM TEXT  �  QST DE W1AW  �

As can be seen from the YouTube video FLDIGI is able to copy this CW quite well.  The new deep learning Morse decoder is also able to decode the audio with probabilities ranging from 4% to over 90% during this period.

It has visible problems when the current image frame cuts the Morse character into parts. The scrolling  128x32 image that is produced from the spectrogram graph does not have any smarts  - it is just copied at every update cycle and fed into the infer_image() function. This means that a single Morse character is moving out of the frame but some part of the character can be still visible, causing incorrect decodes.

The decoder has also problems with some numbers even when fully visible in the 128x32 image frame.  The ARRL training material that I used to build the corpus for training has about 8.6% words that are numbers (such as bands, frequencies and years).  I believe that the current model doesn't have enough examples to decode all the numbers correctly.

The final problem is the lack of spaces between the words. The current model doesn't know about the "Space" character so it is just decoding what it has been trained on.


Software

The python script running the model is quite simple and listed below. I adapted the main Spectogram loop from this Github repo.  I used the following constants in mic_read.py.

RATE = 8000
FORMAT = pyaudio.paInt16 #conversion format for PyAudio stream
CHANNELS = 1 #microphone audio channels
CHUNK_SIZE = 8192 #number of samples to take per read
SAMPLE_LENGTH = int(CHUNK_SIZE*1000/RATE) #length of each sample in ms


specgram.py

"""
Created by Mauri Niininen (AG1LE)
Real time Morse decoder using CNN-LSTM-CTC Tensorflow model

adapted from https://github.com/ayared/Live-Specgram

"""
############### Import Libraries ###############
from matplotlib.mlab import specgram
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import numpy as np
import cv2


############### Import Modules ###############
import mic_read
from morse.MorseDecoder import  Config, Model, Batch, DecoderType


############### Constants ###############
SAMPLES_PER_FRAME = 4 #Number of mic reads concatenated within a single window
nfft = 256 # NFFT value for spectrogram
overlap = nfft-56 # overlap value for spectrogram
rate = mic_read.RATE #sampling rate


############### Call Morse decoder ###############
def infer_image(model, img):
    if img.shape == (128, 32):
        batch = Batch(None, [img])
        (recognized, probability) = model.inferBatch(batch, True)
        return img, recognized, probability
    else:
        print(f"ERROR: img shape:{img.shape}")

# Load the Tensorlow model 
config = Config('model.yaml')
model = Model(open("morseCharList.txt").read(), config, decoderType = DecoderType.BestPath, mustRestore=True)

stream,pa = mic_read.open_mic()


############### Functions ###############
"""
get_sample:
gets the audio data from the microphone
inputs: audio stream and PyAudio object
outputs: int16 array
"""
def get_sample(stream,pa):
    data = mic_read.get_data(stream,pa)
    return data
"""
get_specgram:
takes the FFT to create a spectrogram of the given audio signal
input: audio signal, sampling rate
output: 2D Spectrogram Array, Frequency Array, Bin Array
see matplotlib.mlab.specgram documentation for help
"""
def get_specgram(signal,rate):
    arr2D,freqs,bins = specgram(signal,window=np.blackman(nfft),  
                                Fs=rate, NFFT=nfft, noverlap=overlap,
                                pad_to=32*nfft   )
    return arr2D,freqs,bins

"""
update_fig:
updates the image, just adds on samples at the start until the maximum size is
reached, at which point it 'scrolls' horizontally by determining how much of the
data needs to stay, shifting it left, and appending the new data. 
inputs: iteration number
outputs: updated image
"""
def update_fig(n):
    data = get_sample(stream,pa)
    arr2D,freqs,bins = get_specgram(data,rate)
    
    im_data = im.get_array()
    if n < SAMPLES_PER_FRAME:
        im_data = np.hstack((im_data,arr2D))
        im.set_array(im_data)
    else:
        keep_block = arr2D.shape[1]*(SAMPLES_PER_FRAME - 1)
        im_data = np.delete(im_data,np.s_[:-keep_block],1)
        im_data = np.hstack((im_data,arr2D))
        im.set_array(im_data)

    # Get the image data array shape (Freq bins, Time Steps)
    shape = im_data.shape

    # Find the CW spectrum peak - look across all time steps
    f = int(np.argmax(im_data[:])/shape[1])

    # Create a 32x128 array centered to spectrum peak 
    if f > 16: 
        print(f"n:{n} f:{f}")
        img = cv2.resize(im_data[f-16:f+16][0:128], (128,32)) 
        if img.shape == (32,128):
            cv2.imwrite("dummy.png",img)
            img = cv2.transpose(img)
            img, recognized, probability = infer_image(model, img)
            if probability > 0.0000001:
                print(f"infer_image:{recognized} prob:{probability}")
    return im,

def main():
    
    global im
    ############### Initialize Plot ###############
    fig = plt.figure()
    """
    Launch the stream and the original spectrogram
    """
    stream,pa = mic_read.open_mic()
    data = get_sample(stream,pa)
    arr2D,freqs,bins = get_specgram(data,rate)
    """
    Setup the plot paramters
    """
    extent = (bins[0],bins[-1]*SAMPLES_PER_FRAME,freqs[-1],freqs[0])
    
    im = plt.imshow(arr2D,aspect='auto',extent = extent,interpolation="none",
                    cmap = 'Greys',norm = None) 

    plt.xlabel('Time (s)')
    plt.ylabel('Frequency (Hz)')
    plt.title('Real Time Spectogram')
    plt.gca().invert_yaxis()
    #plt.colorbar() #enable if you want to display a color bar

    ############### Animate ###############
    anim = animation.FuncAnimation(fig,update_fig,blit = True,
                                interval=mic_read.CHUNK_SIZE/1000)

                                
    try:
        plt.show()
    except:
        print("Plot Closed")

    ############### Terminate ###############
    stream.stop_stream()
    stream.close()
    pa.terminate()
    print("Program Terminated")

if __name__ == "__main__":
    main()

I did run this experiment on Macbook Pro (2.2 GHz Quad-Core Intel Core i7) and MacOS Catalina 10.15.3.  The Python version used was Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 05:52:31)  [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin

Conclusions

This experiment demonstrates that it is possible to build a working real time Morse decoder based on deep learning Tensorflow model using a slow interpreted language like Python.  The approach taken here is quite simplistic and lacks some key functionality, such as alignment of decoded text to audio timeline.

It also shows that there are still more work to do in order to build a fully functioning, open source and high performance Morse decoder.  A better event driven software architecture would allow building a proper user interface with some controls, like audio filtering.   Such an architecture would enable also building server side decoders running based on audio feeds from WebSDR receivers etc.

Finally, the Tensorflow model in this experiment has a very small training set, only 27.8 hours of audio.  If you compare to commercial ASR (automatic speech recognition) engines they have been trained using over 1000X  more labeled audio training material.   To get better performance from deep learning models you need to have a lot of high quality labeled training material that matches with the typical sound environment the model will be used on.


73
Mauri AG1LE





Saturday, July 13, 2019

DeepMorse - Web based tool for CW Puzzles and Training





Introduction 


I started working on a new project recently.  The idea behind this "DeepMorse" project is to create a web site that contains curated Morse code audio clips.   The website would allow subscribers to upload annotated CW audio clips (MP3, WAV, etc) and associated metadata.

As a subscriber you would be able to provide the story behind the clip as well as some commentary or even photos. After uploading the site would show the graphical view  of the audio clip much like the modern Software Defined Radios (SDRs) and users would be able to play back the audio and see the metadata.

Since this site would contain "real world" recordings and some really difficult to copy audio clips, this   would also provide ultimate test of your CW copying  skills. The system would save a score on your copying accuracy before it gives you the "ground truth" of annotated audio.  You could compete for the top scores with all the other CW aficionados.

The site could also be used to share historical records of curated Morse code audio materials with the ham radio  community. For CW newbies the site would have a treasure trove of different kinds of training materials when you get tired of listening ARRL morse practice MP3 files.  For experienced CW operators you could share some of your best moments when working using your favorite operating mode, teaching newbies how to catch the "big fish".

User Interface

I wanted to experiment combining audio and graphical waveform view of the audio together, giving the user ability to listen, scroll back and re-listen as well as zoom into the waveform.

Part of the user interface is also the free text form where user can enter the text they heard in the audio clip.  By pressing "Check" button the system will calculate the accuracy compared to the "ground truth" text.  System is using normalized Levenshtein method to calculate the accuracy in percentage (0...100%) where 100% is perfect copy.

Figure 1. below shows the main listening view.
Figure 1. DeepMorse User Interface


Architecture

I wrote this web application using Python Django web framework and it took only a few nights to get the basic structure together.  The website is running in AWS using serverless Lambda functions and serverless Aurora RDS MySQL database.  The audio files are stored into an S3 bucket.

Using serverless database backend sounds like oxymoron, since there is a database server managed by AWS.  It also brings some challenges such as slow "cold start" that will be visible for end users. When you click the "Puzzles" menu you normally will get this view (see Figure 2. below).

Figure 2. Puzzles View 















However, if the serverless database server has timed out due to no activity, it will take more than 30 seconds to come up.  By this time the front end webserver has also timed out and the user will see this below instead (see Figure 3.).  A simple refresh of the browser will fix the situation and both the front end and the backend will be then available. 

Figure 3.  Serverless "Time Out" error message
















So what is then the benefit of using AWS serverless technology?   The benefit is that you get billed only for usage and if the application is not used 24x7 this means significant cost savings. For a hobby project like DeepMorse I am able to run the service very cost efficiently. 

The other benefit of serverless technologies is automatic scaling - if the service becomes suddenly hugely popular the system is able to scale up rapidly. 

Next Steps

I am looking for some feedback from early users trying to figure out what features might be interesting for Morse code  aficionados. 

73 de Mauri 
AG1LE


SCREEN SHOTS






















































































































































Sunday, February 10, 2019

Performance characteristics of the ML Morse Decoder





In my previous blog post I described a new Tensorflow based Machine Learning model that learns Morse code from annotated audio .WAV files with 8 kHz sample rate.

In order to evaluate the performance characteristic of  the decoding accuracy from noisy audio source files I created a set of training & validation materials with Signal-to-Noise Ratio from -22dB to +30 dB.   Target SNR_dB was created using the following Python code:

        # Desired linear SNR
        SNR_linear = 10.0**(SNR_dB/10.0)

        # Measure power of signal - assume zero mean 
        power = morsecode.var()

        # Calculate required noise power for desired SNR
        noise_power = power/SNR_linear

        # Generate noise with calculated power (mu=0, sigma=1)
        noise = np.sqrt(noise_power)*np.random.normal(0,1,len(morsecode))

        # Add noise to signal
        morsecode = noise + morsecode

These audio .WAV files contain random words with maximum 5 characters - 5000 samples at each SNR level with  95% used for training and 5% for validation. The Morse speed in each audio sample was randomly selected from 30 WPM or 25 WPM.

The training was performed until 5 consecutive epochs did not improve the character error rate. The duration of these training sessions varied from 15 - 45 minutes on Macbook Pro with2.2 GHz Intel Core i7 CPU. 

I captured and plotted the Character Error Rate (CER) and Signal-to-Noise Ratio (SNR) of each completed training and validation session.   The following graph shows that the Morse decoder performs quite well until about -12 dB SNR level and below that the decoding accuracy drops fairly dramatically.

CER vs. SNR graph





























To view how noisy these files are here are some random samples - first 4 seconds of 8KHz audio file is demodulated, filtered using  25Hz 3rd order Butterworth filter and decimated by 125 to fit into a (128,32) vector. These vectors is shown as grayscale images below:

-6 db SNR



-11 dB SNR


-13 dB SNR









-16 dB SNR








Conclusions

The Tensorflow model appears to perform quite well on decoding noisy audio files, at least when the training set and validation set have the same SNR level.  

The next experiments could include more variability with a much bigger training dataset that has a combination of different SNR, Morse speed and other variables.  The training duration depends on the amount of training data so it can take a while to perform these larger scale experiments on a home computer.  

73 de
Mauri AG1LE 



Saturday, February 2, 2019

Training a Computer to Listen and Decode Morse Code

Abstract

I trained  a Tensorflow based CNN-LSTM-CTC model  with 5.2 hours of Morse audio training set  (5000 files) and achieved character error rate of 0.1% and word accuracy of 99.5%  I tested the model with audio files containing various levels of noise and found the model to decode relatively accurately down to -3 dB SNR level. 

Introduction

 Decoding Morse code from audio signals is not a novel idea. The author has written many different software decoder implementations that use simplistic models to convert a sequence of "Dits" and "Dahs" to corresponding text.  When the audio signal is noise free and there is no interference,  these simplistic methods work fairly well and produce nearly error free decoding.  Figure 1. below shows "Hello World" with 35 dB signal-to-noise ratio that most conventional decoders don't have any problems decoding.

"Hello World" with 30 dB SNR 





Figure 2 below shows the same "Hello World" but with -12 dB signal-to-noise ratio using exactly same process as above to extract the demodulated envelope. Humans can still hear and even recognize the Morse code faintly in the noise. Computers equipped with these simplistic models have great difficulties decoding anything meaningful out of this signal.  In ham radio terms the difference of 47 dB corresponds roughly eight S units - human ears & brain can still decode S2 level signals whereas conventional software based Morse decoders produce mostly gibberish.

"Hello World" with -12 dB SNR 





New Approach - Machine Learning

I have been quite interested in Machine Learning (ML) technologies for a while.  From software development perspective ML is changing the paradigm how we are processing data.

In traditional programming we look at the input data and try to write a program that uses some processing steps to come up with the output data. Depending on the complexity of the problem software developer may need to spend quite a long time coming up with the correct algorithms to produce the right output data.  From Morse decoder perspective this is how most decoders work:  they take input audio data that contains the Morse signals and after many complex operations the correct decoded text appears on the screen. 

Machine Learning changes this paradigm. As a ML engineer you need to curate a dataset that has a representative selection of input data with corresponding output data (also known as label data).  The computer then applies a training algorithm to this dataset that eventually discovers the correct "program" - the ML model that provides the best matching  function that can infer the correct output, given the input data.

See Figure 3. that tries to depict this difference between traditional programming and the new approach with Machine Learning.
Programming vs. Machine Learning
























So what does this new approach mean in practice?  Instead of trying to figure out ever more complex software algorithms to improve your data processing and accuracy of decoding,  you can select from some standard machine learning algorithms that are available in open source packages like Tensorflow and focus on building a neural network model and curating a large dataset to train this model. The trained model can then be used to make the decoding from the input audio data. This is exactly what I did in the following experiment.

I took a Tensorflow implementation of Handwritten Text Recognition created by Harald Scheidl [3] that he has posted in Github as an open source project.  He has provided excellent documentation on how the model works as well as references to the IAM dataset that he is using for training the handwritten text recognition.

Why would a model created for  handwritten text recognition work for Morse code recognition?

It turns out that the Tensorflow standard learning algorithms used for handwriting recognition are very similar to ones used for speech recognition.

The figures  below are from Hannun, "Sequence Modeling with CTC", Distill, 2017. In the article Hannun [2] shows that the (x,y) coordinates of a pen stroke or pixels in image can be recognized as text, like the spectrogram of speech audio signals.  Morse code has similar properties as speech - the speed can vary a lot and hand-keyed code can have unique rhythm patterns that make it difficult to align signals to decoded text. The common theme is that we have some variable length input data that need to be aligned with variable length output data.  The algorithm that comes with Tensorflow is called Connectionist Temporal Classification (CTC) [1].


 

Morse Dataset

The Morse code audio file can be easily converted to a representation that is suitable as input data for these neural networks.  I am using single track (mono) WAV files with 8 kHz sampling frequency.

The following few lines of Python code takes 4 seconds sample from an existing WAV audio file, finds the signal peak frequency, de-modulates and decimates the data so that we get a (1,256) vector that we re-shape to (128, 32) and write into a PNG file.

def find_peak(fname):
    # Find the signal frequency and maximum value
    Fs, x = wavfile.read(fname)
    f,s = periodogram(x, Fs,'blackman',8192,'linear', False, scaling='spectrum')
    threshold = max(s)*0.9  # only 0.4 ... 1.0 of max value freq peaks included
    maxtab, mintab = peakdet(abs(s[0:int(len(s)/2-1)]), threshold,f[0:int(len(f)/2-1)] )

    return maxtab[0,0]

def demodulate(x, Fs, freq):
    # demodulate audio signal with known CW frequency 
    t = np.arange(len(x))/ float(Fs)
    mixed =  x*((1 + np.sin(2*np.pi*freq*t))/2 )

    #calculate envelope and low pass filter this demodulated signal
    #filter bandwidth impacts decoding accuracy significantly 
    #for high SNR signals 40 Hz is better, for low SNR 20Hz is better
    # 25Hz is a compromise - could this be made an adaptive value?
    low_cutoff = 25. # 25 Hz cut-off for lowpass
    wn = low_cutoff/ (Fs/2.)    
    b, a = butter(3, wn)  # 3rd order butterworth filter
    z = filtfilt(b, a, abs(mixed))
    
    # decimate and normalize
    decimate = int(Fs/64) # 8000 Hz / 64 = 125 Hz => 8 msec / sample 
    o = z[0::decimate]/max(z)
    return o

def process_audio_file(fname, x, y, tone):
    Fs, signal = wavfile.read(fname)
    dur = len(signal)/Fs
    o = demodulate(signal[(Fs*(x)):Fs*(x+y)], Fs, tone)
    return o, dur

filename = "error.wav"
tone = find_peak(filename)
o,dur = process_audio_file(filename,0,4, tone)
im = o[0::1].reshape(1,256)
im = im*256.

img = cv2.resize(im, (128, 32), interpolation = cv2.INTER_AREA)
cv2.imwrite("error.png",img)

Here is the resulting PNG image - it contains  "ERROR M". The labels are kept in a file that contains also the corresponding audio file name.

4 second audio sample converted to a (128,32) PNG file







It is very easy to produce a lot of training and validation data with this method. The important part is that each audio file must have accurate "labels" - this is the textual representation of the Morse audio file.

I created a small Python script to produce this kind of Morse training and validation dataset. With a few parameters you can generate as much  data as you want with different speed and noise levels.

Model

I used Harald's model to start the Morse decoding experiments. 

The model consists of 5 CNN layers, 2 RNN (LSTM) layers and the CTC loss and decoding layer. The illustration below gives an overview of the NN (green: operations, pink: data flowing through NN) and here follows a short description:
  • The input image is a gray-value image and has a size of 128x32
  • 5 CNN layers map the input image to a feature sequence of size 32x256
  • 2 LSTM layers with 256 units propagate information through the sequence and map the sequence to a matrix of size 32x80. Each matrix-element represents a score for one of the 80 characters at one of the 32 time-steps
  • The CTC layer either calculates the loss value given the matrix and the ground-truth text (when training), or it decodes the matrix to the final text with best path decoding or beam search decoding (when inferring)
  • Batch size is set to 50
















It is not hard to imagine making some changes to the model to allow for longer audio clips to be decoded. Right now the limit is about 4 seconds audio converted to (128x32) input image.  Harald is actually providing details of a model that can handle larger input image (800x64) and output up to 100 characters strings.

Experiment

Here are parameters I used for this experiment:

  • 5000 samples, split into training and validation set: 95% training - 5% validation
  • Each sample has 2 random words, max word length is 5 characters
  • Morse speed randomly selected from  [20, 25, 30] words-per-minute  
  • Morse audio SNR: 40 dB 
  • batchSize: 100  
  • imgSize: [128,32] 
  • maxTextLen: 32
  • earlyStopping: 20 

Training time  was 1hr 51mins  on a Macbook Pro 2.2 GHz Intel Core i7
Training curves of character error rate, word accuracy and loss after 50 epochs were the following:


Training over 50 epochs














The best character error rate was 14.9% and word accuracy was 36.0%.  These are not great numbers - the reason was that I had training data containing 2 words in each sample - in many cases this was too many characters to fit in the 4 second time window, therefore the training algorithm did not see the second word in the training material in many cases. 

I did re-run the experiment with 5000 samples, but with just one word in each sample.  It took 54 mins  7 seconds to do this training.  New parameters are below:

model:
    # model constants
    batchSize: 100  
    imgSize: !!python/tuple [128,32] 
    maxTextLen: 32
    earlyStopping: 5

morse:
    fnTrain:    "morsewords.txt"
    fnAudio:    "audio/"
    count:      5000
    SNR_dB:     
      - 20
      - 30
      - 40
    f_code:     600
    Fs:         8000
    code_speed: 
      - 30
      - 25
      - 20
    length_N:   65000
    play_sound: False
    word_max_length: 5
    words_in_sample: 1

experiment:
    modelDir:   "model/"
    fnAccuracy: "model/accuracy.txt"
    fnTrain:    "model/morsewords.txt"
    fnInfer:    "model/test.png"
    fnCorpus:   "model/corpus.txt"
    fnCharList: "model/charList.txt"


Here is the outcome of that second training session:

Total training time was 0:54:07.857731
Character error rate:  0.1%. Word accuracy: 99.5%.

Training over 33 epochs



With a larger dataset the training will take longer. One possibility would be to use AWS cloud computing service to accelerate the training for a much larger dataset. 

Note that the model did not know anything about Morse code at the start. It did learn the character set, the structure of the Morse code and the words just by "listening" through the provided sample files. This is approximately 5.3 hours of Morse code audio materials with random words.   (5000 files * 95% * 4 sec/file = 19000 seconds).  

It would be great to get some comparative data on how quickly humans will learn to produce similar character error rate. 

Results

I created a small "helloword.wav" audio file with HELLO WORLD text at 25 WPM in different signal-to-noise ratios (-6, -3, +6, +50) dB to test the first model. 

Attempting to decode the content of the audio file I got the following results.  Given that the training was done with +40 dB samples I was quite surprised to see relatively good decoding accuracy. The model also provides probability how confident it is about the result. These values vary between 0.4% to 5.7%. 


File: -6 dB SNR 
python MorseDecoder.py -f audio/helloworld.wav 
Validation character error rate of saved model: 15.4
Python: 2.7.10 (default, Aug 17 2018, 19:45:58) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)]
Tensorflow: 1.4.0
2019-02-02 22:40:51.970393: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Init with stored values from model/snapshot-22
inferBatch: probs:[ 0.00420194] texts:['HELL Q PE'] 
Recognized: "HELL Q PE"
Probability: 0.00420194

['HELL Q PE']

-6 dB HELLO WORLD











File: -3 dB SNR 
python MorseDecoder.py -f audio/helloworld.wav 
Validation character error rate of saved model: 15.4
Python: 2.7.10 (default, Aug 17 2018, 19:45:58) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)]
Tensorflow: 1.4.0
2019-02-02 22:36:32.838156: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Init with stored values from model/snapshot-22
inferBatch: probs:[ 0.05750186] texts:['HELLO WOE'] 
Recognized: "HELLO WOE"
Probability: 0.0575019

['HELLO WOE']
-3 dB HELLO WORLD







File: +6 dB SNR 
python MorseDecoder.py -f audio/helloworld.wav 
Validation character error rate of saved model: 15.4
Python: 2.7.10 (default, Aug 17 2018, 19:45:58) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)]
Tensorflow: 1.4.0
2019-02-02 22:38:57.549928: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Init with stored values from model/snapshot-22
inferBatch: probs:[ 0.03523131] texts:['HELLO WOT'] 
Recognized: "HELLO WOT"
Probability: 0.0352313
['HELLO WOT']

+6 dB HELLO WORLD





File: +50 dB SNR 
python MorseDecoder.py -f audio/helloworld.wav 
Validation character error rate of saved model: 15.4
Python: 2.7.10 (default, Aug 17 2018, 19:45:58) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)]
Tensorflow: 1.4.0
2019-02-02 22:42:55.403738: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
inferBatch: probs:[ 0.03296029] texts:['HELLO WOT'] 
Recognized: "HELLO WOT"
Probability: 0.0329603
['HELLO WOT']
+50 dB HELLO WORLD








In comparison, I took one file that was used in the training process. This file contains "HELLO HERO" text at +40 dB SNR. Here is what the decoder was able to decode - with much higher probability 51.8% 

File: +40 dB SNR 

python MorseDecoder.py -f audio/6e753ac57d4849ef87d5146e158610f0.wav
Validation character error rate of saved model: 15.4
Python: 2.7.10 (default, Aug 17 2018, 19:45:58) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)]
Tensorflow: 1.4.0
2019-02-02 22:53:27.029448: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Init with stored values from model/snapshot-22
inferBatch: probs:[ 0.51824665] texts:['HELLO HERO'] 
Recognized: "HELLO HERO"
Probability: 0.518247
['HELLO HERO']
+40 dB HELLO HERO

Conclusions

This is my first machine learning experiment where I used Morse audio files for both training and validation of the model.  The current model limitation is that only 4 second audio clips can be used.  However, it is very feasible to build a larger model that can decode longer audio clip with a single inference operation.  Also, it would be possible to feed a longer audio file in 4 second pieces to get decoding happening across the whole file.

This Morse decoder doesn't have a single line of code that would explicitly spell out the Morse codebook.  The model literally learned from the training data what Morse code is and how to decode it.  It represents a new paradigm in building decoders, and is using similar technology what companies like Google, Microsoft, Amazon and Apple are using for their speech recognition products.

I hope that this experiment demonstrates to the ham radio community how to build high quality, open source Morse decoders using a simple, standards based ML architecture.  With more computing capacity and larger training / validation datasets that contain accurate annotated (labeled) audio files  it is now feasible to build a decoder that will surpass the accuracy of conventional decoders (like the one in FLDIGI software).

73  de Mauri
AG1LE

Software and Instructions

The initial version of the software is available in Github - see here

Using from the command line:

python MorseDecoder.py -h
usage: MorseDecoder.py [-h] [--train] [--validate] [--generate] [-f FILE]

optional arguments:
  -h, --help  show this help message and exit
  --train     train the NN
  --validate  validate the NN
  --generate  generate a Morse dataset of random words
  -f FILE     input audio file


To get started you need to generate audio training material. The count variable in model.yaml config file tells how many samples will get generated. Default is 5000.

python MorseDecoder.py --generate


Next you need to perform the training. You need to have "audio/", "image/" and "model/" subdirectories on the folder you are running the program.

python MorseDecoder.py --train


Last this to do is to validate the model:

python MorseDecoder.py --validate

To have the model decode a file you should use:

python MorseDecoder.py -f audio/myfilename.wav 




Config file model.yaml  (first training session):
model:
    # model constants
    batchSize: 100  
    imgSize: !!python/tuple [128,32] 
    maxTextLen: 32
    earlyStopping: 20 

morse:
    fnTrain:    "morsewords.txt"
    fnAudio:    "audio/"
    count:      5000
    SNR_dB:     20
    f_code:     600
    Fs:         8000
    code_speed: 30
    length_N:   65000
    play_sound: False
    word_max_length: 5
    words_in_sample: 2

experiment:
    modelDir:   "model/"
    fnAccuracy: "model/accuracy.txt"
    fnTrain:    "model/morsewords.txt"
    fnInfer:    "model/test.png"
    fnCorpus:   "model/corpus.txt"
    fnCharList: "model/charList.txt"

Config file model.yaml  (second training session):
model:
    # model constants
    batchSize: 100  
    imgSize: !!python/tuple [128,32] 
    maxTextLen: 32
    earlyStopping: 5

morse:
    fnTrain:    "morsewords.txt"
    fnAudio:    "audio/"
    count:      5000
    SNR_dB:     
      - 20
      - 30
      - 40
    f_code:     600
    Fs:         8000
    code_speed: 
      - 30
      - 25
      - 20
    length_N:   65000
    play_sound: False
    word_max_length: 5
    words_in_sample: 1

experiment:
    modelDir:   "model/"
    fnAccuracy: "model/accuracy.txt"
    fnTrain:    "model/morsewords.txt"
    fnInfer:    "model/test.png"
    fnCorpus:   "model/corpus.txt"
    fnCharList: "model/charList.txt"

References

[1]  A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376. https://www.cs.toronto.edu/~graves/icml_2006.pdf
[2]  Hannun, "Sequence Modeling with CTC", Distill, 2017.  https://distill.pub/2017/ctc/
[3] Harald Scheidl "Handwritten Text Recognition with TensorFlow", https://github.com/githubharald/SimpleHTR

Popular Posts