Voice App Basics
Below, you can see a high-level illustration, how a voice app works:
The user first converses with a smart speaker (or other voice assistant enabled device) and then prompts it to open your app (which is also called invocation) with a certain
speech input. The voice platform recognizes this and sends a
request to your app. It is now up to you to use this request to craft a
response, which is returned to the platform's API. The response is then turned into
speech output (what your user is hearing), which can consist of both text-to-speech and audio files.
In the next section, we will go a little deeper into the underlying principles of natural language processing and understanding:
In order to find out what a user wants when they're talking to your app, platforms like Alexa or Google Assistant do a lot of underlying work for you to interpret the natural language of user voice input. To build for these platforms, it's important to understand a few elements of natural language understanding. Simplified, a language model can be divided in "what the user wants" (
intent) and "what the user says" (
utterances and specific
In the example above, possible sentences a user could say can be grouped to a
'FindRestaurantIntent', while there are potential slots like
restaurant type. Please note that this is a very simple example, other slot types like
price range could also be considered.
In natural language processing (NLP), an intent is something users want to achieve when they are conversing with technology. When developing your voice app, you can create intents to handle different user needs and interactions.
Find more detailed information here: App Logic > Routing.
What are all the potential phrases someone could use to express what they want? Having a good set of utterances increases your chances of reacting to a user's input.
Find more detailed information here: App Configuration > Jovo Language Model.
This element has different wordings on various NLP platforms. It describes a specific, variable element in a set of utterances. For example, your intent could be to find a restaurant, but you could search for a pizza place, sushi, or even something cheap or very close.
Find more detailed information here: App Logic > Data.
The Jovo framework currently supports Amazon Alexa, Google Assistant, Samsung Bixby, Facebook Messenger, Twilio Autopilot, Dialogflow, and even your own web client. In this section, you can learn more about what you need to set up on the respective Developer Consoles in order to make your voice apps work.
See also: Beginner Tutorial: Build an Alexa Skill in Node.js with Jovo.
On the Amazon Echo product suite or other devices that support Amazon Alexa, users can access so-called Skills by asking the device something like this:
In order to build a skill for Amazon Alexa, you need to create an account on the Amazon Developer Portal. You can find an official step-by-step guide by Amazon here.
See also: Beginner Tutorial: Build a Google Action in Node.js with Jovo.
On Google Home, users converse with the Google Assistant. Apps for the Assistant are called Actions on Google (however, they're now in the process of switching to just calling them Apps).
They can be accessed like this:
Here is the official Google resource: Invocation and Discovery.
To build a voice app for Google Assistant and Google Home, you need to create a project on the Actions on Google Console. For interpreting the natural language of your users' speech input, you can use different kinds of integrations. Most developers use Dialogflow ES (formerly Dialogflow, API.AI) for the language model.
See also: Beginner Tutorial: How to use Samsung Bixby with the Jovo Framework.
Apps for Samsung's Bixby are called Capsules. While Bixby shares many basic development concepts with Alexa and Google Assistant, Bixby Capsules follow a declarative driven development process.
You can find out more about in our Bixby documentation.