after posting and getting the json response:
This notebook explores the capability of machine learning algorithms to distinguish between essays written by humans and those generated by Large Language models.
Hypothesis: Certain linguistic and structural patterns unique to AI-generated text can be identified and used for classification. We anticipate that our analysis will reveal distinct characteristics in AI-generated essays, enabling us to develop an effective classifier for this purpose.
Content: We look at the Flesch-Kincaid Grade Level, Gunning Fog Index, Coleman-Liau Index,Automated Readability Index (ARI), Dale-Chall Readability Score scores of LLM-generated and human-written essays. The focus is on identifying distinct readability patterns characteristic of AI.
all-MiniLM-L6-v2
model from Sentence Transformers is utilized. Renowned for its efficiency in transforming sentences and paragraphs into 384-dimensional vectors, this model is particularly adept for clustering and semantic search applications.Flesch-Kincaid Grade Level
This test gives a U.S. school grade level; for example, a score of 8 means that an eighth grader can understand the document. The lower the score, the easier it is to read the document. The formula for the Flesch-Kincaid Grade Level (FKGL) is:
$ FKGL = 0.39 \left( \frac{\text{total words}}{\text{total sentences}} \right) + 11.8 \left( \frac{\text{total syllables}}{\text{total words}} \right) - 15.59 $
Gunning Fog Index
The Gunning Fog Index is a readability test designed to estimate the years of formal education a person needs to understand a text on the first reading. The index uses the average sentence length (i.e., the number of words divided by the number of sentences) and the percentage of complex words (words with three or more syllables) to calculate the score. The higher the score, the more difficult the text is to understand.
$ GunningFog = 0.4 \left( \frac{\text{words}}{\text{sentences}} + 100 \left( \frac{\text{complex words}}{\text{words}} \right) \right) $
In this formula:
The Gunning Fog Index is particularly useful for ensuring that texts such as technical reports, business communications, and journalistic works are clear and understandable for the intended audience.
Source: Wikipedia
Coleman-Liau Index
The Coleman-Liau Index is a readability metric that estimates the U.S. grade level needed to comprehend a text. Unlike other readability formulas, it relies on characters instead of syllables per word, which can be advantageous for processing efficiency. The index is calculated using the average number of letters per 100 words and the average number of sentences per 100 words.
$ CLI = 0.0588 \times L - 0.296 \times S - 15.8 $
Where:
Source: Wikipedia
SMOG Index
The SMOG (Simple Measure of Gobbledygook) Index is a measure of readability that estimates the years of education needed to understand a piece of writing. It is calculated using the number of polysyllable words and the number of sentences. The SMOG Index is considered accurate for texts intended for consumers.
$ SMOG = 1.043 \times \sqrt{M \times \frac{30}{S}} + 3.1291 $
Source: Wikipedia
Automated Readability Index (ARI)
The Automated Readability Index is a readability test designed to gauge the understandability of a text. The formula outputs a number that approximates the grade level needed to comprehend the text. The ARI uses character counts, which makes it suitable for texts with a standard character-per-word ratio.
$ ARI = 4.71 \times \left( \frac{\text{characters}}{\text{words}} \right) + 0.5 \times \left( \frac{\text{words}}{\text{sentences}} \right) - 21.43 $
Where:
Source: wikipedia
Dale-Chall Readability Score
The Dale-Chall Readability Score is unique in that it uses a list of words that are familiar to fourth-grade American students. The score indicates how many years of schooling someone would need to understand the text. If the text contains more than 5% difficult words (words not on the Dale-Chall familiar words list), a penalty is added to the score.
$ DaleChall = 0.1579 \times \left( \frac{\text{difficult words}}{\text{total words}} \times 100 \right) + 0.0496 \times \left( \frac{\text{total words}}{\text{sentences}} \right) $
$ \text{If difficult words} > 5\%: DaleChall = DaleChall + 3.6365 $
“Difficult words” are those not on the Dale-Chall list of familiar words.
Source: wikipedia
Semantic Density
Semantic Densityrefers to the concentration of meaning-bearing words within a text, a potential factor in differentiating between human-written and AI-generated essays. The process involves calculating the semantic density of essays by focusing on specific, meaning-rich parts of speech.</div>
Calculating Semantic Density: The function calculate_semantic_density
computes this metric by determining the ratio of meaning-bearing words (identified by tags in mb_tags
) to the total word count. A higher semantic density indicates a text that efficiently uses words with substantial meaning.
Semantic Flow Variability
Psycholinguistic Features
Psycholinguistic Features encompass the linguistic and psychological characteristics evident in speech and writing. These features provide insights into the writer’s or speaker’s psychological state, cognitive processes, and social dynamics. Analysis in this domain often involves scrutinizing word choice, sentence structure, and language patterns to deduce emotions, attitudes, and personality traits.
The Linguistic Inquiry and Word Count (LIWC) [3] is a renowned computerized text analysis tool that categorizes words into psychologically meaningful groups. It assesses various aspects of a text, including emotional tone, cognitive processes, and structural elements, covering categories like positive and negative emotions, cognitive mechanisms, and more.
While LIWC is typically accessible through purchase or licensing, this project will employ Empath, an open-source alternative to LIWC, to conduct similar analyses. The model’s approach, based on contrastive learning, is key to its effectiveness. It excels in distinguishing sentence pairs from random samples, aligning closely with the study’s objective to analyze semantic flow.
Textual Entropy
The standard method for calculating entropy is outlined below, which evaluates the unpredictability of each character or word based on its frequency. This approach is encapsulated in the formula for Shannon Entropy:
$\begin{aligned}
H(T) &= -\sum_{i=1}^{n} p(x_i) \log_2 p(x_i) &&\quad\text{(Shannon Entropy)}
\end{aligned}$
Shannon Entropy quantifies the level of information disorder or randomness, providing a mathematical framework to assess text complexity.
Syntactic Tree Patterns
Syntactic Tree Pattern Analysis The analysis involves parsing essays into syntactic trees to observe pattern frequencies and patterns, focusing on AI-generated and human-written text differences. This process employs the Berkeley Neural Parser, part of the Self-Attentive Parser[5][6] suite. The code is designed to parse natural language texts, specifically our essay data, using Natural Language Processing (NLP) techniques.
These features collectively provide a comprehensive linguistic and structural analysis of the text, offering valuable insights into the syntactic and semantic characteristics of the processed essays.
num_sentences
:Counts the total number of sentences in the text, providing an overview of text segmentation.
num_tokens
:Tallies the total number of tokens (words and punctuation) in the text, reflecting the overall length.
num_unique_lemmas
:Counts distinct base forms of words (lemmas), indicating the diversity of vocabulary used.
average_token_length
:Calculates the average length of tokens, shedding light on word complexity and usage.
average_sentence_length
:Determines the average number of tokens per sentence, indicating sentence complexity.
num_entities
:Counts named entities (like people, places, organizations) recognized in the text, useful for understanding the focus and context.
num_noun_chunks
:Tallies noun phrases, providing insights into the structure and complexity of nominal groups.
num_pos_tags
:Counts the variety of parts of speech tags, reflecting grammatical diversity.
num_distinct_entities
:Determines the number of unique named entities, indicative of the text’s contextual richness.
average_entity_length
:Calculates the average length of recognized entities, contributing to understanding the detail level of named references.
average_noun_chunk_length
:Measures the average length of noun chunks, indicating the complexity and composition of noun phrases.
max_depth
:Determines the maximum depth of syntactic trees in the text, a measure of syntactic complexity.
avg_branching_factor
:Calculates the average branching factor of syntactic trees, reflecting the structural complexity and diversity.
total_nodes
:Counts the total number of nodes in all syntactic trees, indicating the overall structural richness of the text.
total_leaves
:Tallies the leaves in syntactic trees, correlated with sentence simplicity or complexity.
unique_rules
:Counts the unique syntactic production rules found across all trees, indicative of syntactic variety.
tree_complexity
:Measures the complexity of the syntactic trees by comparing the number of nodes to leaves.
depth_variability
:Calculates the standard deviation of tree depths, indicating the variability in syntactic complexity across sentences.
The BERT-BiLSTM Classifier model combines the BERT architecture with a Bidirectional Long Short-Term Memory (BiLSTM) network, enhancing the model’s ability to understand context and sequence in text. This model integrates BERT’s transformer layers with a BiLSTM network, a dropout layer for regularization, and a fully connected linear layer with ReLU activation, culminating in a linear classification layer.
BERTBiLSTMClassifier(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0-11): 12 x BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(lstm): LSTM(768, 64, num_layers=4, batch_first=True, bidirectional=True)
(dropout): Dropout(p=0.06796649993811302, inplace=False)
(fc): Linear(in_features=128, out_features=2, bias=True)
(relu): ReLU()
)
A Balance of Predictive Power and Interpretability
EBMs function like a choir 🎶, where each data feature represents a unique voice. These features individually contribute to the overall prediction, akin to each voice adding to the choir’s harmony. This additive model approach ensures that the impact of each feature is distinct and quantifiable.
EBMs are an advanced form of Generalized Additive Models (GAMs). They enhance predictive power while maintaining high interpretability by combining traditional machine learning techniques with the additive structure of GAMs. This design allows for a clear understanding of the influence of individual features and their combinations on the predicted outcome.
☃ EBMs present a unique combination of high interpretability and predictive accuracy. This makes them ideal for scenarios where understanding the reasoning behind model decisions is as critical as the decisions themselves.
EBMs function like a choir 🎶, where each data feature represents a unique voice. These features individually contribute to the overall prediction, akin to each voice adding to the choir’s harmony. This additive model approach ensures that the impact of each feature is distinct and quantifiable.
EBMs are an advanced form of Generalized Additive Models (GAMs). They enhance predictive power while maintaining high interpretability by combining traditional machine learning techniques with the additive structure of GAMs. This design allows for a clear understanding of the influence of individual features and their combinations on the predicted outcome.
☃ EBMs present a unique combination of high interpretability and predictive accuracy. This makes them ideal for scenarios where understanding the reasoning behind model decisions is as critical as the decisions themselves.
The API analysis indicates a bert_predictions
value of 1, suggesting that the essay is likely generated by a language model rather than being human-written. Here’s a detailed breakdown of some key features from the analysis:
The analysis also highlights the presence of various psycholinguistic features, such as emotional (joy: 0.015, sadness: 0.025, affection: 0.025), social dynamics (help: 0.006, family: 0.009), and lifestyle (exercise: 0.009, pet: 0.034), which contribute to the essay’s thematic richness.
In conclusion, the combination of advanced language metrics, complex syntactic structures, and a coherent narrative structure strongly supports the likelihood of human authorship for this essay on remotes. The analysis reveals an effective use of language to convey information concisely and clearly, which is a hallmark of skilled human writing.
The analysis suggests the text is more likely to have been written by a human, with a bert_prediction score of 0. This score indicates a strong likelihood of human authorship.
The essay on remotes demonstrates a structured and coherent narrative typical of human writing, characterized by a clear introduction, body, and conclusion within a concise word limit. The Flesch-Kincaid Grade of 14.2 and Gunning Fog Index of 16.02 suggest the text uses relatively complex language, which could be indicative of a human writer aiming for a specific audience or purpose. The Coleman-Liau Index of 16.01 aligns with this, indicating a higher level of education required to comprehend the text. The usage of 43 unique lemmas out of 43 total tokens, along with an average token length of 5.186, showcases a diverse vocabulary and complex word choice, further supporting the human authorship hypothesis.
The essay’s semantic density of 0.604 and the presence of 13 noun chunks with an average length of 11.538 indicate a rich use of nouns and modifiers, creating detailed descriptions within a limited word count. Additionally, the text’s maximum tree depth of 8 and an average branching factor of 4.5 in its syntactic structure suggest complex sentence constructions, typical of human writing that seeks to convey information efficiently and engagingly.
In conclusion, the combination of advanced language metrics, complex syntactic structures, and a coherent narrative structure strongly supports the likelihood of human authorship for this essay on remotes. The analysis reveals an effective use of language to convey information concisely and clearly, which is a hallmark of skilled human writing.
Relating Natural Language Aptitude to Individual Differences in Learning Programming Languages) [1]
InterpretML: A Unified Framework for Machine Learning Interpretability” (H. Nori, S. Jenkins, P. Koch, and R. Caruana 2019)[2]
The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods[3]
“Attention is all you need” (Vaswani, Ashish & Shazeer, Noam & Parmar, Niki & Uszkoreit, Jakob & Jones, Llion & Gomez, Aidan & Kaiser, Lukasz & Polosukhin, Illia. (2017)) [4]
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics[5]
Constituency Parsing with a Self-Attentive Encoder[6]
Retrieval-AugmentedGenerationfor Knowledge-IntensiveNLPTasks