NLP A-to-Z: From data collection to a fully trained model
Data is everywhere. It is used in every aspect of our life - health, insurance, finance, traveling, science, you name it. Collecting quality data in an efficient way is a challenging task — we need to define the right target in advance, aim at the right audience, and make sure the data quality is optimal. Good data can be used to train ML models which later serve the above applications. In these talks you will learn how to gather quality data starting from data scraping and data annotation and finishing with building the full model and its effect on results quality.
Data scraping
Itamar Abramovich, Director of Data Products, BrightData
Fact: The internet is the largest database ever created. It is where our market, industry and the public’s reality happen by the second. To remain competitive and relevant every company, organization and business must tap into web data. Even the once most reluctant organizations have now turned to web data such as banks, finance services and more.
In this session, Bright Data Director of Data Products, Itamar Abramovich will discuss and show real-life examples of why and how web data has made and is still making a huge difference in a company’s growth strategy.
Bright Data is the industry-leading web data platform with over 15,000 customers and partners from across every industry. The company has made it its mission to deliver quality, reliable public web data with ease and simplicity.
Join this expert presentation to learn from up-close how web data can help you solve some of your most critical challenges today.
Off-the-shelf solutions will only get you so far
Shay Hummel, Director of Knowledge Mining, SparkBeyond
While knowledge is power, it’s often fragmented, disorganized, and inaccessible. SparkBeyond’s Knowledge Mining system parses the web’s wealth of unstructured data to deliver contextual answers to high-stakes problems. Our knowledge graph generator, Knomi, strings together multiple search engines with SOTA language models to produce structured responses. Off-the-shelf models produced impressive results yet for many applications the accuracy was not sufficient.
We needed to adapt the models to our needs. We will describe how we incorporate real-life labeled data to add a bespoke model on top of the off-the-shelf models. We will discuss the importance of diverse datasets and how this work improved accuracy and customer satisfaction.
End-to-end question answering on a handheld device for the benefit of people with reading difficulties
Tal Rosenwein, VP of R&D, AI and Algorithms, OrCam
Dyslexia affects 15-20% of the world's population; it is a language-based learning disability that results in difficulties with specific language skills, particularly reading. Dyslexic people usually experience difficulties with other language skills such as spelling, pronouncing words, and reading comprehension.
In this talk, I will present a question-answering feature that helps to improve comprehension capabilities. Using a voiced-based interface, the user can query physical documents (e.g., books, newspapers) captured by the OrCam device, and the answer is played through the speakers. This feature incorporates models from multiple domains such as computer vision (CV), optical character recognition (OCR), automatic speech recognition (ASR), natural language processing (NLP), and text to speech (TTS).
Get in touch with us
Join our Slack community: https://toloka.ai/community
Check out more of our events: https://toloka.ai/events
Read our Medium: https://medium.com/toloka
Follow us in social networks to make sure you won't miss any updates.
Twitter: https://twitter.com/TolokaAI
Facebook: https://facebook.com/globaltoloka
Linkedin: https://linkedin.com/company/toloka/
00:00:00 - Beginning
00:03:15 - Data scraping
00:30:20 - Off-the-shelf solutions will only get you so far
00:52:45 - End-to-end question answering on a handheld device for the benefit of people with reading difficulties
01:19:55 - Toloka with adaptive models