TextLab

Byte-Pair Encoding for Text Tokenization in Natural Language Processing

Mohamad Mahmood — Mon, 02 Feb 2026 02:59:51 GMT

1. Introduction

Modern natural language processing (NLP) systems do not operate directly on raw text. Instead, text must first be transformed into a structured representation that models can process efficiently. One of the most important steps in this transformation is tokenization, the process of breaking text into smaller units called tokens.

Early NLP systems often relied on word-level tokenization, where each distinct word is treated as an atomic unit. While simple, this approach suffers from several limitations, including large vocabularies, poor handling of rare or unseen words, and difficulties with morphologically rich languages. To address these issues, subword-based tokenization methods were introduced. Among them, Byte-Pair Encoding (BPE) has become one of the most influential and widely adopted techniques.

2. Motivation for Subword Tokenization

Word-level tokenization assumes that words are the smallest meaningful units. In practice, this assumption fails for several reasons:

Vocabulary Explosion
Real-world corpora contain millions of distinct word forms due to inflection, compounding, spelling variations, and domain-specific terms.
Out-of-Vocabulary (OOV) Words
New or rare words cannot be represented if they are not present in the predefined vocabulary.
Morphological Complexity
Many languages encode meaning through prefixes, suffixes, and stems that are shared across words.

Subword tokenization solves these problems by representing text as sequences of smaller units that can be recombined to form words. BPE is one of the earliest and most effective algorithms to implement this idea.

3. What Is Byte-Pair Encoding?

Byte-Pair Encoding is a data-driven, frequency-based tokenization algorithm that constructs a subword vocabulary by repeatedly merging the most frequent adjacent symbols in a corpus.

Originally developed as a data compression technique, BPE was later adapted for NLP to learn meaningful subword units automatically, without requiring linguistic rules or annotations.

In modern NLP, BPE typically operates at the byte or character level, allowing it to represent any text, including numbers, punctuation, and Unicode symbols.

4. The BPE Training Process

The BPE algorithm consists of a training phase followed by an encoding phase.

4.1 Initial Representation

Text is first broken into basic symbols. In byte-level BPE, each symbol corresponds to a UTF-8 byte. This guarantees that every possible character can be represented, regardless of language or script.

For example, a word is initially represented as a sequence of individual bytes.

4.2 Counting Adjacent Pairs

The algorithm scans the entire training corpus and counts how often each adjacent pair of symbols appears.

Example (conceptual):

t + h
h + e
e + r

Each pair is associated with a frequency count.

4.3 Merging the Most Frequent Pair

The most frequent adjacent pair is merged into a new symbol. This new symbol becomes part of the growing vocabulary.

For instance, if t + h occurs most frequently, it may be merged into a single token th.

4.4 Iterative Merging

The merge process is repeated a fixed number of times or until no frequent pairs remain. Each iteration increases the vocabulary size and creates larger subword units.

Over time, common words, prefixes, and suffixes naturally emerge as stable tokens.

5. Encoding New Text

Once the merge rules have been learned, they are applied deterministically to new input text.

The encoding process:

Converts input text into initial symbols (bytes or characters).
Applies the learned merge rules in the same order used during training.
Produces a sequence of subword tokens.

This ensures that the same text is always tokenized consistently, which is essential for reproducibility and model training.

6. Properties of BPE Tokenization

6.1 Open Vocabulary

Because BPE operates on bytes or characters, it can represent any input string without generating unknown tokens.

6.2 Frequency-Driven Units

Tokens correspond to patterns that occur frequently in the training corpus. Common words and morphemes become single tokens, while rare words are decomposed into smaller units.

6.3 Language-Agnostic

BPE does not rely on linguistic rules. It works equally well for different languages, scripts, and mixed-language text.

6.4 Deterministic and Efficient

Once trained, BPE tokenization is fast and deterministic, making it suitable for large-scale NLP systems.

7. Demo

7.1. Corpus Example:

Byte Pair Encoding is a tokenization method used in natural language processing.
Byte Pair Encoding learns frequent byte pairs and merges them repeatedly.

In cloud security research, tokenization helps encode malicious payloads.
Cloud-native malware exploits cloud infrastructure and cloud services.
Cloud security analysts study cloud logs, cloud traffic, and cloud threats.

Tokenization transforms text into tokens.
Tokens may represent words, subwords, or byte sequences.
Encoding text into tokens allows efficient storage and modeling.

Cyber threat analysis often involves malware detection and payload decoding.
Malware payloads may contain encoded strings, repeated patterns, and obfuscated data.
Repeated patterns help Byte Pair Encoding learn stable subword units.

Unicode examples: café naïve résumé señor 😀 😀 😀
Numbers and versions: version1 version2 version10 v1.0 v1.1 v1.2

7.2. Test Example 1:

Tokenization and encoding enable efficient cloud security analysis.

7.3. Test Example 2 (Subword-heavy example):

Tokenization and encoding enable efficient cloud security analysis.

7.4. Test Example 3 (Unicode-focused):

Malware café payload naïve résumé 😀

7.5. Test Example 4 (Numeric / versioning):

cloud-security v1.2 tokenization version10 encoding

7.6. Test

https://codepen.io/mohamad-razzi/full/xbOWvJo

8. Why BPE Is Widely Used in Modern NLP

BPE has become a foundational component in many transformer-based language models and NLP pipelines because it strikes a balance between:

Vocabulary size
Expressiveness
Robust handling of rare and unseen words

Its simplicity, efficiency, and compatibility with neural models have made it a standard choice for text representation.

9. Interpreting BPE Through Visualization

When visualized, BPE tokenization reveals how text is gradually segmented into meaningful subword units. Common sequences appear as larger tokens, while less frequent patterns remain split.

Such visualizations help illustrate:

How repetition influences learned tokens
Why certain substrings are merged
How Unicode characters are handled at the byte level

These insights are especially valuable for teaching and debugging NLP systems.

10. Conclusion

Byte-Pair Encoding is a powerful and conceptually simple approach to subword tokenization. By learning frequent patterns directly from data, it avoids the limitations of word-level tokenization while remaining efficient and language-independent.

Understanding BPE is essential for anyone working with modern NLP models, as it forms the bridge between raw text and numerical representations used by machine learning systems. When paired with interactive visualizations, BPE becomes not only a practical tool but also an intuitive framework for understanding how machines process language.

🤓

Appendix

https://web.stanford.edu/~jurafsky/slp3/2.pdf

Text Representation Basics for Natural Language Processing [Interactive Simulation]

Mohamad Mahmood — Mon, 02 Feb 2026 01:43:08 GMT

[1] Introduction

Natural Language Processing (NLP) aims to enable computational systems to understand, analyze, and generate human language. A fundamental challenge in NLP lies in the representation of language in a form that machines can process effectively. The success of any NLP system depends heavily on how textual data is represented before modeling and analysis.

Text is not a flat or uniform object. Instead, it is naturally organized into multiple hierarchical levels, ranging from large collections of texts to smaller linguistic units. These levels include corpora, documents, sentences, words, and finally tokens, which serve as the computational units consumed by NLP algorithms.

This article presents a foundational overview of text representation in NLP, structured from the highest level of abstraction to the most granular.

[2] Demo

https://codepen.io/mohamad-razzi/full/raLdobb

[3] Corpus

A corpus is a structured collection of texts assembled for linguistic analysis or computational modeling. Corpora serve as the primary data source for NLP systems and are used to train, evaluate, and benchmark language models.

Corpora may consist of written text, spoken language transcripts, social media posts, or multimodal content. They can be monolingual or multilingual and may contain language variation such as dialects, informal usage, or code-switching. Because linguistic patterns vary significantly across contexts, corpus composition has a direct impact on the generalizability and fairness of NLP systems.

[4] Document

Within a corpus, a document represents an individual unit of text, such as an article, report, transcript, message, or post. Documents are typically treated as coherent units produced by a single author or speaker, often associated with metadata such as time, source, or author identity.

In NLP, documents often define the scope of analysis for tasks such as classification, information retrieval, topic modeling, and sentiment analysis.

[5] Sentence

A sentence is a smaller linguistic unit within a document, commonly associated with a complete thought or proposition. In written text, sentence boundaries are typically indicated by punctuation, while in spoken language they correspond more closely to utterances rather than strictly grammatical sentences.

Many NLP systems operate at the sentence level before aggregating results at the document or corpus level.

[6] Words

Traditionally, words are viewed as the fundamental building blocks of sentences. In corpus analysis, it is important to distinguish between word types and word instances. Word types refer to unique vocabulary items, while word instances refer to the total number of word occurrences. Vocabulary size grows continuously as corpus size increases, a phenomenon that poses challenges for computational models that rely on fixed vocabularies.

Because of these issues, word-based representations often suffer from sparsity and the presence of unseen or rare words. These limitations have motivated the shift toward subword-based approaches in modern NLP.

[7] Tokenization

Tokenization is the process of converting raw text into a sequence of tokens that serve as the input to NLP models. Tokens are not necessarily equivalent to words; they may correspond to words, subwords, characters, or even bytes, depending on the tokenization strategy employed.

Modern NLP systems may also rely on subword tokenization methods. By decomposing rare or unseen words into smaller, reusable units, subword tokenization addresses the unknown-word problem and enables models to generalize beyond their training data.

[8] Discussion

The progression from corpus to tokenization reflects a hierarchy of abstraction in text representation. Each level imposes structure on language data, enabling computational systems to manage linguistic complexity.

In modern NLP, this hierarchy is tightly integrated into end-to-end pipelines.

[9] Conclusion

Text representation forms the foundation of natural language processing. By organizing language data into a hierarchy of corpora, documents, sentences, words, and tokens, NLP systems transform raw text into structured input suitable for computational modeling. Each level introduces distinct challenges and design choices that directly influence system performance, robustness, and generalizability.

🤓

Appendix

[1] Speech and Language Processing. Daniel Jurafsky & James H. Martin

https://ontheline.trincoll.edu/images/bookdown/sample-local-pdf.pdf

WordNet: Search Synsets

Mohamad Mahmood — Wed, 02 Oct 2024 23:36:01 GMT

[1] Find word association, return associated synsets, part-of-speech and sense number

import nltk
from nltk.corpus import wordnet

# Ensure you have the WordNet data
nltk.download('wordnet')

def find_associations(word):
    synsets = wn.synsets(word)
    associations = {
        'synonyms': set(),
        'hypernyms': set(),
        'hyponyms': set(),
        'similar_words': set()
    }

    for synset in synsets:
        # Get the formatted synset name (e.g., 'trout.n.02')
        synset_name = synset.name()

        # Get synonyms
        synonyms = synset.lemma_names()
        for synonym in synonyms:
            associations['synonyms'].add(f"{synonym} ({synset_name})")

        # Get hypernyms
        for hypernym in synset.hypernyms():
            hypernym_name = hypernym.name()
            associations['hypernyms'].add(f"{hypernym_name} ({hypernym_name})")

        # Get hyponyms
        for hyponym in synset.hyponyms():
            hyponym_name = hyponym.name()
            associations['hyponyms'].add(f"{hyponym_name} ({hyponym_name})")

        # Get similar words
        for similar in synset.similar_tos():
            similar_name = similar.name()
            associations['similar_words'].add(f"{similar_name} ({similar_name})")

    return associations

# Find associations for "positive"
positive_associations = find_associations('positive')
print("Associations for 'positive':")
for relation_type, words in positive_associations.items():
    print(f"{relation_type.capitalize()}: {list(words)}")

# Find associations for "trust"
trust_associations = find_associations('confident')
print("\nAssociations for 'confident':")
for relation_type, words in trust_associations.items():
    print(f"{relation_type.capitalize()}: {list(words)}")

output:

Associations for 'positive':
Synonyms: ['positive (positive.a.08)', 'irrefutable (incontrovertible.s.01)', 'prescribed (positive.s.05)', 'positively_charged (positive.s.10)', 'electropositive (positive.s.10)', 'confirming (positive.a.04)', 'positive (cocksure.s.01)', 'positive (positive.s.05)', 'positive (positive.n.01)', 'overconfident (cocksure.s.01)', 'plus (plus.s.02)', 'confident (convinced.s.01)', 'positive (positive.s.10)', 'positivistic (positivist.a.01)', 'positivist (positivist.a.01)', 'positive (incontrovertible.s.01)', 'cocksure (cocksure.s.01)', 'positive (positivist.a.01)', 'positive_degree (positive.n.01)', 'positive (plus.s.02)', 'positive (positive.a.01)', 'positive (convinced.s.01)', 'positive (positive.a.04)', 'convinced (convinced.s.01)', 'positive (positive.n.02)', 'incontrovertible (incontrovertible.s.01)', 'positive (positive.s.09)']
Hypernyms: ['film.n.03 (film.n.03)', 'adverb.n.02 (adverb.n.02)', 'adjective.n.01 (adjective.n.01)']
Hyponyms: []
Similar_words: ['advantageous.a.01 (advantageous.a.01)', 'formal.a.01 (formal.a.01)', 'certain.a.02 (certain.a.02)', 'affirmative.s.02 (affirmative.s.02)', 'plus.a.01 (plus.a.01)', 'gram-positive.s.01 (gram-positive.s.01)', 'constructive.s.02 (constructive.s.02)', 'confident.a.01 (confident.a.01)', 'charged.a.01 (charged.a.01)', 'undeniable.a.01 (undeniable.a.01)']

Associations for 'confident':
Synonyms: ['surefooted (confident.s.03)', 'confident (confident.a.01)', 'positive (convinced.s.01)', 'convinced (convinced.s.01)', 'confident (convinced.s.01)', 'confident (confident.s.03)', 'sure-footed (confident.s.03)']
Hypernyms: []
Hyponyms: []
Similar_words: ['reassured.s.01 (reassured.s.01)', 'certain.a.02 (certain.a.02)', 'capable.a.01 (capable.a.01)', 'self-assured.s.01 (self-assured.s.01)', 'cocksure.s.01 (cocksure.s.01)', 'assured.s.01 (assured.s.01)']

Breakdown of the Output

e.g. Synonyms: ['positive (positive.a.08)', 'irrefutable (incontrovertible.s.01)', 'prescribed (positive.s.05)', 'positively_charged (positive.s.10)', 'electropositive (positive.s.10)', 'confirming (positive.a.04)', 'positive (cocksure.s.01)', 'positive (positive.s.05)', 'positive (positive.n.01)', 'overconfident (cocksure.s.01)', 'plus (plus.s.02)', 'confident (convinced.s.01)', 'positive (positive.s.10)', 'positivistic (positivist.a.01)', 'positivist (positivist.a.01)', 'positive (incontrovertible.s.01)', 'cocksure (cocksure.s.01)', 'positive (positivist.a.01)', 'positive_degree (positive.n.01)', 'positive (plus.s.02)', 'positive (positive.a.01)', 'positive (convinced.s.01)', 'positive (positive.a.04)', 'convinced (convinced.s.01)', 'positive (positive.n.02)', 'incontrovertible (incontrovertible.s.01)', 'positive (positive.s.09)']

General Format:
- Each entry in the list has the format word (synset_name), where:
  - word is the synonym.
  - synset_name indicates the synset to which the word belongs, with:
    - The first part (e.g., positive) being the lemma (the base form of the word).
    - The second part (e.g., a for adjective, n for noun) indicating the part of speech.
    - The last part (e.g., 01, 02) indicating the sense number, which distinguishes between different meanings of the same word.
Example Entries:
- 'positive (positive.a.08)':
  - This means that "positive" is an adjective (indicated by a) and belongs to the eighth sense of the word "positive" in WordNet.
- 'irrefutable (incontrovertible.s.01)':
  - "Irrefutable" is a synonym for "positive" and is associated with the first sense of the adjective "incontrovertible" (s indicating it's an adjective).
- 'prescribed (positive.s.05)':
  - "Prescribed" is a synonym that is associated with the fifth sense of "positive."
- 'confident (convinced.s.01)':
  - "Confident" is synonymous with "positive" and is linked to the first sense of "convinced."
Multiple Entries:
- Some synonyms appear multiple times with different senses. For instance, "positive" itself appears several times with different sense numbers (positive.a.01, positive.s.05, etc.), indicating that it has multiple meanings or usages in different contexts.
Diverse Synonyms:
- The list includes a mix of direct synonyms (words that can often replace "positive" in a sentence) and related terms that capture similar meanings or connotations but may not be interchangeable in all contexts.

[2] Find word association, return associated synsets

def find_associations(word):
    synsets = wn.synsets(word)
    associations = {
        'synonyms': set(),
        'hypernyms': set(),
        'hyponyms': set(),
        'similar_words': set()
    }

    for synset in synsets:
        # Get synonyms
        synonyms = synset.lemma_names()
        associations['synonyms'].update(synonyms)

        # Get hypernyms
        for hypernym in synset.hypernyms():
            associations['hypernyms'].update(hypernym.lemma_names())

        # Get hyponyms
        for hyponym in synset.hyponyms():
            associations['hyponyms'].update(hyponym.lemma_names())

        # Get similar words
        for similar in synset.similar_tos():
            associations['similar_words'].update(similar.lemma_names())

    return associations

# Find associations for "positive"
positive_associations = find_associations('positive')
print("Associations for 'positive':")
for relation_type, words in positive_associations.items():
    print(f"{relation_type.capitalize()}: {list(words)}")

# Find associations for "trust"
trust_associations = find_associations('confident')
print("\nAssociations for 'confident':")
for relation_type, words in trust_associations.items():
    print(f"{relation_type.capitalize()}: {list(words)}")

Output:

Associations for 'positive':
Synonyms: ['cocksure', 'irrefutable', 'prescribed', 'confident', 'positivistic', 'positive_degree', 'positivist', 'electropositive', 'incontrovertible', 'overconfident', 'plus', 'convinced', 'positive', 'positively_charged', 'confirming']
Hypernyms: ['adverb', 'photographic_film', 'adjective', 'film']
Hyponyms: []
Similar_words: ['plus', 'undeniable', 'affirmative', 'advantageous', 'sure', 'charged', 'optimistic', 'confident', 'certain', 'Gram-positive', 'constructive', 'formal']

Associations for 'confident':
Synonyms: ['sure-footed', 'confident', 'surefooted', 'convinced', 'positive']
Hypernyms: []
Hyponyms: []
Similar_words: ['cocksure', 'capable', 'assured', 'self-confident', 'sure', 'reassured', 'overconfident', 'self-assured', 'certain', 'positive']

[3] Part of Speech Codes in WordNet

n = Noun
- Represents a person, place, thing, or idea.
- Example: dog.n.01 refers to the first sense of "dog" as a noun.
v = Verb
- Represents an action, occurrence, or state of being.
- Example: run.v.01 refers to the first sense of "run" as a verb.
a = Adjective
- Describes or modifies a noun, providing more information about it.
- Example: happy.a.01 refers to the first sense of "happy" as an adjective.
s = Adjective Satellite
- A type of adjective that provides additional descriptive information, often used in comparative forms.
- Example: taller.s.01 refers to the comparative form of "tall."
r = Adverb
- Modifies verbs, adjectives, or other adverbs, often indicating manner, place, time, frequency, degree, etc.
- Example: quickly.r.01 refers to the first sense of "quickly" as an adverb.

Explanation of Each Part of Speech

Noun (n): Nouns are fundamental to language, representing entities and concepts. They can be further categorized into concrete nouns (physical objects) and abstract nouns (ideas, qualities).
Verb (v): Verbs are crucial for expressing actions and states. They can indicate physical actions (like "run"), mental actions (like "think"), or states of being (like "exist").
Adjective (a): Adjectives enrich language by providing descriptive details about nouns. They help to convey qualities, quantities, or characteristics, enhancing the specificity of communication.
Adjective Satellite (s): Satellite adjectives often modify the meaning of another adjective or provide additional context. They are particularly useful when comparing or contrasting qualities.
Adverb (r): Adverbs add depth to sentences by modifying verbs, adjectives, or other adverbs. They can describe how, when, where, or to what extent an action occurs, thereby adding nuance to the action or description.

How does data scientist differ from statistician?

Mohamad Mahmood — Wed, 06 Mar 2024 13:41:19 GMT

Data scientists and statisticians share many commonalities in their work, but there are some key differences in their focus, skill sets, and perspectives. Here are a few ways in which data scientists and statisticians differ:

Focus: Data scientists primarily focus on extracting insights and knowledge from data to drive practical decision-making and solve real-world problems. They often work with large, complex datasets and employ techniques from various disciplines, including statistics, machine learning, computer science, and domain expertise. Statisticians, on the other hand, primarily focus on designing experiments, collecting data, and analyzing it to understand the underlying patterns, relationships, and uncertainties. Their work is often centered around statistical theory and inference.
Tools and Techniques: Data scientists employ a wide range of tools and techniques such as machine learning algorithms, data visualization, big data processing frameworks, and programming languages like Python or R. They leverage these tools to handle large-scale data, build predictive models, and extract insights from complex datasets. Statisticians, on the other hand, often use statistical software packages like R or SAS, and they apply a variety of statistical techniques such as hypothesis testing, regression analysis, time series analysis, or experimental design.
Problem-solving Approach: Data scientists are typically focused on solving practical problems and delivering actionable insights. They often work in interdisciplinary teams and collaborate with domain experts to understand the business context and formulate data-driven solutions. They are skilled in problem formulation, data preprocessing, feature engineering, model selection, and evaluation. Statisticians, on the other hand, are more focused on developing statistical models, designing experiments, analyzing data, and drawing valid inferences. They often emphasize the interpretation and communication of statistical results, as well as the understanding of underlying assumptions and limitations.
Data Scale and Complexity: Data scientists often work with massive datasets, including structured, unstructured, and streaming data. They are skilled in handling big data challenges, data engineering, and distributed computing. Statisticians, while they may also work with large datasets, often deal with smaller, carefully curated datasets and place more emphasis on statistical inference and model assumptions.
Business Understanding: Data scientists typically have a strong understanding of business problems and domain knowledge. They work closely with stakeholders to define the problem, identify relevant data sources, and develop solutions that align with the business goals. Statisticians, while they may also work on business problems, often have a stronger focus on statistical theory, methodology, and the mathematical foundations of statistical techniques.

It's worth noting that these distinctions are not absolute, and there is significant overlap between the roles of data scientists and statisticians. Many professionals in these fields possess a combination of skills and expertise, and the specific roles and responsibilities can vary depending on the organization, industry, and project requirements.

Sentiment analysis using NLTK SentimentIntensityAnalyzer and NRC Lexicon

Mohamad Mahmood — Wed, 06 Mar 2024 03:10:19 GMT

Download lexicon:

# Make data directory if it doesn't exist
!mkdir -p data
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/upshot-trump-emolex/data/NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt -P data

Define processing task:

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

#nltk.download('vader_lexicon')

def load_emolex_lexicon():
    emolex_lexicon = {}
    with open('/content/data/NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt', 'r') as file:
        for line in file:
            line = line.strip()
            if line:
                try:
                    word, emotion, value = line.split('\t')
                    if emotion == 'positive' and float(value) > 0.0:
                        emolex_lexicon[word] = 1
                    elif emotion == 'negative' and float(value) > 0.0:
                        emolex_lexicon[word] = -1
                except ValueError:
                    pass
    return emolex_lexicon

def analyze_sentiment(text):
    sia = SentimentIntensityAnalyzer()
    sentiment_scores = sia.polarity_scores(text)
    emolex_lexicon = load_emolex_lexicon()
    emolex_scores = {
        'pos': sum(sentiment_scores[word] for word in sentiment_scores if word in emolex_lexicon and sentiment_scores[word] > 0),
        'neg': sum(sentiment_scores[word] for word in sentiment_scores if word in emolex_lexicon and sentiment_scores[word] < 0),
        'neu': sum(sentiment_scores[word] for word in sentiment_scores if word not in emolex_lexicon),
        'compound': sentiment_scores['compound']
    }
    return emolex_scores

text = "This is a great day!"

scores = analyze_sentiment(text)
print(scores)

Sentiment analysis using NLTK SentimentIntensityAnalyzer and SentiWordNet Lexicon

Mohamad Mahmood — Tue, 05 Mar 2024 18:59:48 GMT

Get dependencies:

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

Analyze text:

import nltk
from nltk.corpus import sentiwordnet as swn
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('sentiwordnet')

def map_sentiment(word, pos):
    try:
        senti_synsets = list(swn.senti_synsets(word, pos))
        if len(senti_synsets) > 0:
            sentiment = sum(
                [senti_synset.pos_score() - senti_synset.neg_score() for senti_synset in senti_synsets]
            ) / len(senti_synsets)
            if sentiment > 0:
                return 'pos'
            elif sentiment < 0:
                return 'neg'
    except KeyError:
        pass
    return 'neu'

def analyze_sentiment(text):
    sia = SentimentIntensityAnalyzer()
    sentiment_scores = sia.polarity_scores(text)
    tokenized_text = nltk.word_tokenize(text)
    tagged_text = nltk.pos_tag(tokenized_text)
    sentiment_scores_swn = [
        map_sentiment(token, pos) for token, pos in tagged_text
    ]
    sentiment_scores_combined = {
        'pos': sentiment_scores['pos'] + sentiment_scores_swn.count('pos'),
        'neg': sentiment_scores['neg'] + sentiment_scores_swn.count('neg'),
        'neu': sentiment_scores['neu'] + sentiment_scores_swn.count('neu'),
        'compound': sentiment_scores['compound']
    }
    return sentiment_scores_combined

text = "This is a great day!"
scores = analyze_sentiment(text)
print(scores)

Output:

{'pos': 0.594, 'neg': 0.0, 'neu': 6.406, 'compound': 0.6588}

Conclusion:

The above code provides sentiment scores for the given text "This is a great day!" using both the SentimentIntensityAnalyzer and SentiWordNet. The output sentiment scores are as follows:

Positive sentiment score (pos): 0.594
Negative sentiment score (neg): 0.0
Neutral sentiment score (neu): 6.406
Compound sentiment score (compound): 0.6588 The positive score indicates a moderately positive sentiment in the text, while the negative score is 0, suggesting no negative sentiment. The neutral score is relatively high, which means there are many words classified as neutral in the text. The compound score of 0.6588 represents an overall positive sentiment.

Please note that sentiment analysis is not always perfect and can vary depending on the context and the specific algorithms or lexicons used.

Sentiment analysis using NLTK SentimentIntensityAnalyzer and AFINN Lexicon

Mohamad Mahmood — Tue, 05 Mar 2024 18:55:33 GMT

from nltk.sentiment import SentimentIntensityAnalyzer

# Create an instance of the SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

# Load the AFINN-111.txt file
afinn_file = "/content/AFINN-en-165.txt"  # Replace with the actual path to the AFINN-111.txt file

# Read the AFINN-111.txt file and create a dictionary of word-score pairs
afinn_scores = {}
with open(afinn_file, encoding="utf8") as file:
    for line in file:
        word, score = line.strip().split("\t")
        afinn_scores[word] = int(score)

# Update the analyzer's lexicon with AFINN scores
sia.lexicon.update(afinn_scores)

# Example text
text = "I love using NLTK for natural language processing!"

# Perform sentiment analysis using the AFINN lexicon
sentiment = sia.polarity_scores(text)

# Print the sentiment scores
print(sentiment)

Output:

{'neg': 0.0, 'neu': 0.443, 'pos': 0.557, 'compound': 0.7424}

The sentiment scores provided for the text "I love using NLTK for natural language processing!" using the AFINN sentiment analyzer with NLTK are as follows:

Negative sentiment score (neg): 0.0
Neutral sentiment score (neu): 0.443
Positive sentiment score (pos): 0.557
Compound sentiment score (compound): 0.7424 The compound score is a normalized value that combines the positive and negative scores to give an overall sentiment score. In this case, the compound score of 0.7424 indicates a highly positive sentiment for the given text.

Sentiment analysis using NLTK SentimentIntensityAnalyzer and VADER Lexicon

Mohamad Mahmood — Tue, 05 Mar 2024 18:38:18 GMT

The NLTK (Natural Language Toolkit) is a popular Python library for natural language processing tasks. While NLTK provides various functionalities for text processing and analysis, it does not directly include a sentiment analysis module.

However, NLTK can be used in conjunction with other libraries or models to perform sentiment analysis. One common approach is to use NLTK in combination with a pre-trained sentiment analysis model, such as the VADER (Valence Aware Dictionary and sEntiment Reasoner) model.

VADER is a rule-based sentiment analysis model that is specifically designed for sentiment analysis on social media text, where conventional techniques may not perform well. NLTK provides an interface to utilize the VADER model for sentiment analysis.

Here is an example of how you can perform sentiment analysis using NLTK with the VADER model:

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Download the VADER lexicon (if not already downloaded)
nltk.download('vader_lexicon')

# Create an instance of the VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Example text
text = "I love using NLTK for natural language processing!"

# Perform sentiment analysis
sentiment = sia.polarity_scores(text)

# Print the sentiment scores
print(sentiment)

Output:

{'neg': 0.0, 'neu': 0.417, 'pos': 0.583, 'compound': 0.7901}

The polarity_scores() method of the SentimentIntensityAnalyzer class returns a dictionary with sentiment scores for the input text. The scores include positive, negative, neutral, and compound values.

The sentiment analysis results are in the form of a dictionary with four key-value pairs:

'neg': 0.0: This indicates the negative sentiment score for the input text. In this case, the score is 0.0, which means there is no negative sentiment detected in the text.
'neu': 0.417: This represents the neutral sentiment score. The score of 0.417 suggests that a significant portion of the text is considered neutral in terms of sentiment.
'pos': 0.583: This corresponds to the positive sentiment score. With a value of 0.583, it suggests that a considerable portion of the text is classified as positive sentiment.
'compound': 0.7901: The compound score is a normalized, aggregated sentiment score that combines the positive, negative, and neutral scores. The compound score ranges from -1 to 1, where values closer to 1 indicate highly positive sentiment, and values closer to -1 indicate highly negative sentiment. In this case, the compound score of 0.7901 suggests a strongly positive sentiment overall.

Based on the sentiment analysis results, the text appears to have a predominantly positive sentiment.

Please note that NLTK's sentiment analysis using VADER is a rule-based approach and may not always be suitable for all scenarios. Depending on your specific requirements, you might explore other sentiment analysis models or libraries such as TextBlob, spaCy, or custom-trained machine learning models.

The SentimentIntensityAnalyzer class in NLTK's nltk.sentiment module uses the VADER (Valence Aware Dictionary and sEntiment Reasoner) lexicon by default. However, NLTK also allows you to use other lexicons with the SentimentIntensityAnalyzer class. Here are a few lexicons that you can use:

AFINN: The AFINN lexicon is a list of pre-computed sentiment scores for English words. Each word in the lexicon is assigned a score between -5 and +5, indicating its sentiment polarity. Positive scores indicate positive sentiment, while negative scores indicate negative sentiment.
SentiWordNet: SentiWordNet is a lexical resource that assigns sentiment scores to synsets (sets of synonymous words) in WordNet, a large lexical database for English. It provides sentiment scores for individual words based on their senses and part-of-speech tags.
EmoLex: The EmoLex lexicon is a collection of English words associated with eight basic emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. Each word in the lexicon is labeled with one or more emotion categories.

Word Count vs Word Freq

Mohamad Mahmood — Thu, 15 Feb 2024 03:13:42 GMT

In the context of text analysis, "count" and "frequency" are terms used to describe different ways of representing the occurrence of words or terms within a given text or corpus.

Count: Count refers to the actual number of occurrences of a specific word or term in a text or corpus. For example, if the word "apple" appears three times in a document, the count of "apple" would be three. Counts provide a raw measure of how many times a word occurs, without considering the relative importance or significance of the word.
Frequency: Frequency, on the other hand, is a normalized measure that indicates the proportion or percentage of times a specific word or term appears in a text or corpus. It is calculated by dividing the count of a word by the total number of words in the text or corpus. Frequencies provide a relative measure that allows for comparisons between different words or terms.

For instance, if a document contains 100 words and the word "apple" appears 5 times, the frequency of "apple" would be 5/100 = 0.05 or 5%. This frequency value represents the relative importance or prevalence of the word "apple" in the document.

Frequencies are often used in text analysis tasks such as document classification, topic modeling, and information retrieval. They help identify significant terms, keywords, or topics in a corpus based on their relative presence. Additionally, frequencies can be used to identify stopwords (commonly used and less informative words) or to calculate term weighting measures like Term Frequency-Inverse Document Frequency (TF-IDF), which further emphasize the importance of terms in a collection of documents.

In summary, "count" refers to the actual number of occurrences of a word, while "frequency" represents the proportion or percentage of times a word appears in relation to the total number of words in a text or corpus.