Style Classification of Rabbinic Literature for Detection of Lost Midrash Tanhuma Material

Midrash collections are complex rabbinic works that consist of text in multiple languages, that evolved through long processes of instable oral and written transmission. Determining the origin of a given passage in such a compilation is not always straightforward and is often a matter disputed by scholars, yet it is essential for scholars’ understanding of the passage and its relationship to other texts in the rabbinic corpus. To help solve this problem, we propose a system for classification of rabbinic literature based on its style, leveraging recently released pretrained Transformer models for Hebrew. Additionally, we demonstrate how our method can be applied to uncover lost material from the Midrash Tanhuma.

We propose a system for the classification of rabbinic literature. This system detects unique stylistic patterns in the language of the text and can help uncover lost midrashic material quoted in later works. As a test case, we use our method to detect lost sections of the Midrash Tanh . uma that are quoted in the Yalkut Shimoni. 1

II RELATED WORK
In recent years, advancements in natural language processing (NLP) and machine learning (ML) have greatly expanded the toolkit available for tasks such as authorship attribution, plagiarism detection, and style classification. These tools have been successfully employed in a variety of contexts, from the analysis of contemporary texts to the examination of historical documents.
In the broad context of textual analysis, Juola [2008] provides a thorough review of authorship attribution, offering a comprehensive understanding of the state of the art in computational methods for authorship attribution and style classification and their applicability to different types of texts.
The application of these techniques to biblical texts has seen particular innovation in recent years. Dershowitz et al. [2015] introduced a method for automatic biblical source criticism, examining preferences among synonyms and other stylistic attributes; technical details may be found in Koppel et al. [2011]. This approach laid a foundation for using stylistic analysis in the context of classical Hebrew texts. Building on that work, Akiva and Koppel [2013] developed an unsupervised algorithm for decomposing multi-author documents, further reinforcing the applicability of NLP and ML models in the field of authorship attribution. Siegal and Shmidman [2018] applied computational tools to reconstruct Mekhilta Deuteronomy, a lost midrash halakha from the school of Rabbi Akiva. Although their research shares a common goal with ours, their approach begins with a list of candidate texts and primarily focuses on eliminating quotes or near-quotes of existing material from other sources. In contrast, our work addresses the problem of generating an initial candidate list for a specific genre.
From a methodological perspective, it is worth noting that our research also bridges the typically separate approaches of text reuse and stylometry or style detection. While text reuse predominantly focuses on content words and semantics, stylometry often concentrates on function words and other habitual linguistic choices that an author may use subconsciously. By combining these two methods, our study offers a unique perspective on text analysis. Forstall and Scheirer [2019] provide an insightful discussion on this topic, while surveying the various computation tools used for the study of intertextuality.
Finally, the synergy between technology and humanities research is increasingly appreciated. A significant example is the Ithaca project [Assael et al., 2022], a joint venture between DeepMind and the University of Oxford, which focuses on the restoration and classification of ancient Greek epigraphs. This project demonstrates the potential of such interdisciplinary collaboration and offers a model for similar initiatives. This collaboration model resonated with our approach and influenced how we conducted our research. Figure 1: The text-reuse engine, RWFS, shows how a medieval midrash paragraph is reusing early material from various sources including Midrash Tanh . uma.

Dataset
Our training dataset was extracted from Sefaria's resources. 2 We use the raw text files and divide them into the following categories of rabbinic texts:

Mishnah.
In this category we include all tractates of the Mishnah and the Tosefta. Both collections are generally dated to the second century CE and consist of rabbinic rulings and debates, organized by topic.

Midrash Halakhah.
These collections are dated to around the same time of the Mishnah, but they are organized according to the Pentateuch and focus more on the exegesis of biblical verses. In this class we include: Mekhilta d'Rabbi Yishmael, Mekhilta d'Rashbi, Sifra, Sifre Numbers, and Sifre Deuteronomy.
Jerusalem Talmud. We include all tractates of the Jerusalem Talmud, omitting the Mishnah passages that provide the basis for discussion. These texts for the most part are written in a mixture of Hebrew and Palestinian Aramaic and are roughly dated to the 4th c. CE.
Babylonian Talmud. We include all tractates of the Babylonian Talmud, omitting the Mishnah passages that provide the basis for discussion. These texts for the most part are written in a mixture of Hebrew and Babylonian Aramaic and are roughly dated to the 5th c.
Midrash Aggadah. In this category we include early midrash works assumed to have been composed during the amoraic period (up to the 5th c.) or slightly later. The works included in training are: Genesis Rabbah, Leviticus Rabbah, and Pesikta de-Rav Kahanna. Like midrash halakhah these works follow the order of verses in the Bible, but in contrast they focus less on deriving rulings (halakhah) and more on expounding on the biblical narrative. Other works that we did not use during training but which we partially associate with this category include: Ruth Rabbah, Lamentations Rabbah, and Canticles Rabbah.
Midrash Tanh . uma. In this category we include later midrashic works that make up what is referred to as Tanh . uma-Yelammedenu Literature. The works included in training are: Midrash Tanh . uma, Tanh . uma Buber, and Deuteronomy Rabbah. Other works that we did not use during training but we partially associate with this category include Exodus Rabbah starting from Section 15 3 and Numbers Rabbah starting from Section 15. 4 We divide these works into continuous blocks of 50 words. We then clean the text by removing vowel signs, punctuation and metadata. In order to neutralize the effect of orthography differences, we also expand common acronyms and standardize spelling for common words and names.
After cleaning and normalizing the data, we split our dataset into training (80%) and validation (20%) sets. Finally, we downsample all majority classes in the validation set to get a balanced dataset.

Models
Baseline. For our baseline model we use a logistic regression model over a bag of n-grams encoding. We include unigrams, bigrams, and trigrams. We use the default parameters from scikit-learn [Pedregosa et al., 2011] but set fit intercept=False to reduce the impact of varying text length and set class weight="balanced" to deal with class imbalance in the training data. This type of model is highly interpretable, enabling us to see the features associated with each class. Finally, we choose this model as our baseline as it generally achieves reasonable results without the need to tune hyperparameters.
AlephBERT. The next model we evaluate is AlephBERT [Seker et al., 2022] -a Transformer model trained with the masked-token prediction training objective on modern Hebrew texts. While this model obtains state-of-the-art results for various tasks on modern Hebrew, performance might not be ideal on rabbinic Hebrew, which differs significantly from modern Hebrew. We train the pretrained model on the downstream task using the Huggingface Transformers framework [Wolf et al., 2020] for sequence classification, using the default parameters for three epochs.
BEREL. The third model we evaluate is BEREL [Shmidman et al., 2022] -a Transformer model trained with a similar architecture to that of BERT-base [Devlin et al., 2019] on rabbinic Hebrew texts. In addition to the potential benefit of using a model that was pretrained on similar texts to those of the target domain, BEREL also uses a modified tokenizer that does not split up acronyms that would otherwise be interpreted as multiple tokens with punctuation marks in between. (Acronyms marked by double apostrophes [or the like] are very common in rabbinic Hebrew.) We train the pretrained model on our downstream task in an identical fashion to the training of the AlephBERT model.
Morphological. Finally, we also train a model that focuses only on morphological features in the text, in an attempt to neutralize the impact of content words. We expect this type of model to detect more "pure" stylistic features that help discriminate between the different textual sources. To extract features from the text, we use a morphological engine for rabbinic Hebrew created by DICTA. 5 We then train a logistic regression model over an aggregation of all morphological features that appear in a given paragraph.

Text Reuse Detection
To achieve our end goal of detecting lost Tanh . uma material, we combine our style classification model with a filtering algorithm based on text reuse detection.
For reuse detection, we utilize RWFS (Rolling Window Fuzzy Search) by Schor et al. [2021]. 6 RWFS uses fuzzy full-text search on windows of n-grams. The system is built on top of a Lucene index, 7 and uses a web-based interface to provide easy querying to technological and non-technological users.
For our corpus of texts for this engine we use all biblical and early rabbinic works using the texts available on Sefaria. We use 3-gram matching and permit a Levenshtein distance of up to 2 for each individual word in the n-gram. The match score for each retrieved document is given by the number of n-gram matches divided by the length of the query and the results are sorted accordingly (Figure 1).

Detecting Lost Tanh . uma Candidates
Tanh . uma-Yelammedenu Literature is a name given to a genre of late midrash works, some of which are lost and only scarcely preserved in anthologies or Genizah fragments (Bregman, 2003, Nikolsky andAtzmon, 2021). One of the lost works was called "Yelammedenu," and we know about it since it is cited in various medieval rabbinic works such as Yalkut Shimoni and the Arukh. 8 While lost Tanh . uma material is explicitly cited in some works, it is often quoted without citation in other midrash anthologies.
To find candidates for "lost" Tanh . uma passages, we apply the following process: 1. Extract all passages from the given midrash collection, in our case Yalkut Shimoni. 2. Split long passages into segments of up to 50 words. 3. Run these segments through the style detection model. 4. Collect segments for which our model gives the highest score to the Tanh . uma class. 5. Run these segments through a text-reuse engine. 6. Keep only segments that do not have a well established source. (Our threshold was #ngram matches ≤ 0.2 × #n-grams in query. 9 )

IV RESULTS
As can be seen in Table 1  (2) predicted class frequencies for passages with high text reuse score; (3) predicted frequencies for passages with low reuse score. tuned model. The BEREL-based model leads by a significant margin. However, we encountered multiple challenges when using this model for inference on paragraphs from Yalkut Shimoni: 1. The model's scores were not calibrated, most predictions were very close to 1.0 or 0.0, making it hard to experiment with different thresholds. 2. BEREL's accuracy on a small sample of paragraphs from Yalkut Shimoni was significantly lower than the corresponding validation accuracy. It seems that BEREL might have relied on some orthographic features that appeared in the training and validation sets but not in the new out-of-distribution text. 3. Transformer-based models are generally less interpretable, and have higher inference costs than classical ML models such as logistic regression. For these reasons, we decided to use our baseline model for inference on Yalkut Shimoni.
In Figure 3, we can see that the the most common errors are mixing 'Tanh . uma' with 'Midrash Aggadah' which are indeed relatively similar genres. On the other hand, 'Babylonian Talmud' and 'Jerusalem Talmud' seem to be the most distinct classes, perhaps due to their extensive use of Aramaic in addition to Hebrew, each in its own unique dialect.
After taking the whole Yalkut Shimoni on the Pentateuch and following the process described in Section 3.4, we can analyze the prevalence of each class in the collection. As can be seen in Figure 2, the Babylonian Talmud is the most quoted class, while the Jerusalem Talmud is rarely, if ever, quoted. Our classifier gives a similar distribution to that of the text-reuse engine. However, when looking only at passages with low reuse score we see that the Babylonian Talmud rarely appears while 'Tanh . uma' becomes the most frequent predicted class by far, followed by 'Midrash Halakha.' This aligns with the fact that we know of lost works that belong to these categories, while the Babylonian Talmud was well preserved throughout the generations as the core text of the rabbinic tradition.
To evaluate our classifier on the target task, we sampled a random set of 50 items classified as Tanh . uma for manual labeling. A midrash expert analyzed these passages and looked them up in the early print edition of Yalkut Shimoni, which tends to include citations in the margins. Sections that were ascribed to Yelammedenu and sections that were recognized as being typical Tanh . uma material were labeled as "positive," while all other passages were labeled "negative." Out of these items, 22 were cited as Yelammedenu, while an additional 8 were recognized as typical Tanh . uma material from lost sources, 10 yielding an approximate precision of 60%.
From Figure 4, we see that the precision grows monotonically with the decision threshold, indicating that the model is useful in recovering lost Tanh . uma material. Furthermore, we see that we can achieve a precision of approximately 80% by setting an appropriate decision threshold without a high cost to recall.

Findings
Using the methodology we described to investigate thoroughly the makeup of Yalkut Shimoni on Deuteronomy, there were some interesting findings that arose, as well as some questions.
A systematic expert examination of all results for the first half of Deuteronomy (approximately 10% of the Pentateuch) revealed that all known citations of Yelammedenu, ranging from 100 to 600 words, had at least one sub-paragraph of 50 words recognized as part of the genre. In most cases, the majority of the citation was also identified as part of the genre. In practical terms, this means that every block of text was correctly identified, resulting in a 100% recall rate.
Interestingly, one paragraph that was detected as "lost Tanh . uma" material was actually cited as Deuteronomy Rabbah in the early print version of Yalkut Shimoni. However, our version of Deuteronomy Rabbah had a very low text reuse match for this paragraph. This result raises the question of whether the author of Yalkut Shimoni had a different version of the text from what we have. 11 Another notable finding is that some of the lost midrash collections known only from Ashkenaz (e.g. ‫,אבכיר‬ ‫,אספה‬ ‫זוטא‬ ‫)דברי‬ got a very high score for Tanh . uma style. This might hint that there is a stronger connection between these works and the Tanh . uma literature than previously thought, and perhaps they should be considered as part of the same genre as Tanh . uma in some contexts. 13 Finally, there were a number of paragraphs from Sifre Deuteronomy, a midrash halakha of the tannaitic period, that were detected by our classifier as Tanh . uma. One such paragraph (Sifre Deuteronomy 26) contained some notable phrases associate with Tanh . uma and other later midrashic works including ‫הכתוב‬ ‫שׁאמר‬ ‫זהו‬ ("As it is said in Scripture") and ‫הוא‬ ‫ברו‬ ‫הקדושׁ‬ ("The holy one, blessed be He"). 14 As it turns out, in one of the manuscripts (Vatican manuscript 32) some of these terms do not appear. This phenomenon might suggest that over the course of time some terms from later periods such as the Tanh . uma literature might have made their way into our current versions of earlier texts.

V USER TOOLS
In order to provide access to our model's predictions and corresponding explanations, and turn our research into a tool that can assist midrash scholars, we built an interactive application based on the open-source Streamlit platform to wrap our model's inference process. Given an input paragraph, the app displays the scores for each of the classes along with features' (unigrams, bigrams and trigrams) corresponding contributions ( Figure 5).
Additionally, as can be seen in Figure 6, we display the contribution of the various parts of the tend to rework early material more extensively than does Yalkut Shimoni on the Pentateuch. 13 The strong correlation of these texts with the Tanh . uma genre has been validated by the only comprehensive study of these texts, as documented by Geula [2006]. The fact that our method has highlighted a largely unrecognized phenomenon within the Humanities field underscores its significant practical value for scholars.
14 As opposed to the prevalent use of ‫המקו‬ (lit. "The Place") in the tannaitic period, for example, as a metonym for God.  text to the prediction in a more convenient way by highlighting the important features in the text.

Conclusion
In conclusion, our method for detecting Tanh . uma sections in Yalkut Shimoni showcases its potential as a valuable tool for scholars involved in the recovery of lost rabbinic material. This is particularly relevant in the ongoing initiative to develop a digital library of Tanh . uma-Yelammedenu Literature, where our work has significant implications. The tools and classifiers we have developed in this study will prove useful for midrash researchers engaged in compiling various Tanh . uma sources and discovering related and potentially lost material of this genre. These resources are publicly accessible, 15 fostering their widespread future use in related projects.
Beyond its immediate application, our method offers potential for wider use in Jewish studies. One exciting future direction is to examine the baraitot 16 that appear in the Babylonian Talmud and in the Jerusalem Talmud. Investigating their interrelationships and connections with other tannaitic sources could provide fresh insights into these traditions.
Additionally, the suggested method could be applied to the many unorganized and unstudied manuscripts discovered in collections such as the Cairo Geniza, 17 paving the way for their automatic classification. Despite the challenges posed by the noisy text created by current 15 Online at https://github.com/shlomota/Midrash-Style-Classification 16 A tannaitic tradition not incorporated in the Mishnah, see: "Baraita," The Jewish Encyclopedia. 17 Online at https://fgp.genizah.org. engines for handwritten text recognition, the potential benefits to the academic community in terms of improved access and understanding of these vital documents are considerable.