Technical Details

Sound AI

Sound AI technology

Sound contains a lot of information that cannot be expressed in text. We have expertise in detecting special moments or contexts using sound. Based on our team members' 10+ years of academic and industrial experience, we provide sound analysis and machine learning technology to our customers.

  • Sound event detection
  • Voice separation
  • Sentiment recognition


Our sound AI technology is built on the large-amount of real-world sound dataset we gathered ourselves.

14,000+ GB

Sound data

The challenge for companies to apply sound AI to their products and services starts with data gathering. We have large amounts of labeled sound datasets recorded in various environments for production-level deep learning model training. If you are interested in the dataset we gathered, please contact us.

  • Recording in real-home environments
  • Fully labeled
  • Various mic settings
Dataset Detail Contact
How the dataset is gathered


Sound is heavily affected by environments. When analyzing requirements and defining tasks for a production-level sound deep learning model, the space, medium, position and type of mics, and other detailed environmental information must be considered carefully.


Data Gathering

In order to create an sound AI model that works well in a real environment, various factors such as microphone characteristics and recording space must be considered from the data gathering stage. We also gather real-world noises for bettet sound deep learning performance.


Data Refining

The collected data goes through a data refining process before starting deep learning model training. While this process requires a lot of human effort in most case, we minimized human effort by automating the data refining process.



We train our sound deep learning model using the real-world dataset we gathered. The accuracy of our deep learning model has been verified by corporate customers to work well in real-world environments.







Core Technology

Deeply sound AI technology is built on top of advanced machine learning and signal processing techniques

Sound Event Detection

Almost unconsciously, people can simply hear and tell when babies cry, when a glass breaks, or when it rains. It may be a simple task, but few machines are capable of pulling it off. Technological advancements in computer vision has influenced other related fields; computer audition, among several, has achieved a remarkable development in technologies that are as striking as computer vision that recognizes number plates or faces. Deeply dreams of going beyond simply owning such computer audition technologies. We dream of serving people for a better life, with our products and services that encompasses our technologies. Currently in the market, we have a mobile application equipped with a model that instantly detects when babies cry; it distinguishes babies’ crying sounds from other indoor sounds at the level/accuracy of 93% F1 score. This result has been obtained through vigorous testing with ### hours of our first-hand sound data gathered from actual homes, which sets it apart from results that are gained using few hours of public dataset. Moreover, our intelligence team has developed a model that distinguishes baby cries in two types, as well as distinguishing whether a voice is mother’s or father’s at the level/accuracy of 90% F1 score. Instead of focusing on simply increasing the number of sound types to differentiate, we strive to engineer/actualize a model that is more stable and dependable. We are also in the middle of R&D of core technology for spotting women’s screaming in the streets, detecting particular voices, etc.

Context Recognition

Even for us humans, it’s not always easy to get the full picture right out of a sound. Can you always catch someone else’s feelings only by hearing their voices? Or where it’s coming from by hearing some sound? This is why it’s challenging to grasp the context out of a sound, besides simply detecting it. Some with exceptional capacities can of course catch others’ emotional states pretty accurately, or get where a snap of a sound is coming from if it’s someplace familiar. Can a highly advanced/cutting edge artificial intelligence accomplish such sophistication? Our first objective at Deeply was developing a model for detecting a baby’s status through their crying sounds. Not an easy task, but we have believed in the value it can offer to first-time parents. Our current model for classifying babies’ status into five categories performs at approximately 80% F1 score under controlled environment. We are also putting our efforts into R&D for reading negative affects in people’s voices, or construing social relationships among people, including one between a baby and their caregiver.

Impulse Response-based

Applying these auditory models into real world has several constraints. It requires relatively large amounts of data, which is associated with the characteristics of sounds. A sound, which involves propagation of vibration through a medium, is influenced by the physical characteristics of the medium, the distance from the sound source, and the physical characteristics of the space. Converting such an analog wave into a digital one is influenced by characteristics of a microphone and digital circuitry. Besides, spaces in the real world are accompanied by unpredictable noises. Deeply’s learning model has been developed based on Impulse Response(IR), an output from a space when presented with an impulse. Using IR allows better understanding of the characteristics of sounds coming from the space, which means that a model could achieve better performance in the particular space.

Keep in touch with us

Deeply has successfully developed a technology that detects specific sounds in special environments such as smart city, factory, power plant and the military.
Apply the sound AI technology to various industires now.

Contact us