Hands-on - Building Image Gen App using SDXL and Streamlit

In this session, I created a web application for image generationm based on text prompt using SDXL. You can visit the web app here https://endjourney.streamlit.app/

Hands-On - Evidently AI

Official Website
Evidently AI is a platform that specializes in explaining machine learning models, providing transparency and interpretability in their predictions. It helps data scientists and machine learning practitioners understand, visualize, and communicate the behavior of their models. By offering insights into model performance and potential biases, Evidently AI contributes to building more trustworthy and understandable artificial intelligence applications.

Python Hands-on - Python Data Frame Benchmark

Code repo for this session bencharm can be found here

Data Jam EP 6 - Rotary Position Embedding (RoPE)

Rotary Position Embedding, or RoPE, is a type of position embedding which encodes absolute positional information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation.
Notably, RoPE comes with valuable properties such as flexibility of being expand to any sequence lengths, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear self-attention with relative position encoding.

Data Jam EP 5 - Positional Encoding in Transformer

In languages, the order of the words and their position in a sentence really matters.

Data Jam EP 4 - Tokenizer

Overview

Data Jam EP 3 - Multimodal Deep Learning

Overview

Multimodal deep learning is an approach in machine learning that focuses on processing and understanding data from multiple modalities or sources, such as text, images, audio, and more. This approach aims to leverage the complementary information provided by these different data types to improve the accuracy and richness of machine learning models –by gpt

Data Jam EP 2 - Anomaly Detection

Anomaly Types

Point or Global Anomalies → specific data points
A point anomaly is where a single datapoint stands out from the expected pattern, range, or norm. In other words, the datapoint is unexpected.

Data Jam EP 1 - Explore Large Language Model

LLM

A Very Gentle Introduction to Large Language Models without the Hype - by Mark Riedl - Medium
Awesome-LLM: a curated list of Large Language Model
Llama-cpp - 🦜️🔗 Langchain
Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4
Introduction - CS324
COS 597G (Fall 2022): Understanding Large Language Models
Large Language Model Text Generation Inference

TorchSenti - Sentiment Analysis Framework for Researcher with PyTorch

TorchSenti is a natural language library that focuses on sentiment analysis tasks which aims to provide sentiment analysis dataset and pre-trained models. The library build on top of PyTorch, we want to support research community to expand the knowledge and contributors to solve current problems. Those features and resources helps NLP researchers to benchmark and evaluate their proposed method. However, this library may be a starting point for everyone that want to learn sentiment analysis in depth. Find the details on the repository.

Maleo - A Text Preprocessing Tool for Natural Language Processing

Data Scientists spend more than a half of their time for data cleansing including for text. With that problem, in Jakarta Research, we are building a tools that make data scientists job easier to clean the text data such as removing hyperlinks, punctuations, mistyping, etc. You can find more on Github.

Experiments on Paraphrase Identification using Quora Question Pairs

We modeled the Quora question pairs dataset to identify a similar question. The dataset that we use is provided by Quora. The task is a binary classification. We tried several methods and algorithms and different approach from previous works. For feature extraction, we used Bag of Words including Count Vectorizer, and Term Frequency-Inverse Document Frequency with unigram for XGBoost and CatBoost. Furthermore, we also experimented with WordPiece tokenizer which improves the model performance significantly. We achieved up to 97 percent accuracy. Code and Dataset