Combining Morphological and Histogram based Text Line Segmentation in the OCR Context

Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizontal histogram projections. The method was developed to be applied on a historic data collection that commonly features quality issues, such as degraded paper, blurred text, or presence of noise. For that reason, the segmenter in question could be of particular interest for cultural institutions, that want access to robust line bounding boxes for a given historic document. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of reprocessing their historic newspaper collection. The general contribution of this paper is to outline the approach and to evaluate the gains in terms of accuracy and speed, comparing it to the segmentation algorithm bundled with the used open source OCR software.

The motivation for this initiative was threefold: • Due to the large volume of around 6 million newspaper articles, building a computationally efficient segmenter was the main priority and source of motivation for the project. Next to running faster than BENCH, the aim was for the method to only take up a fraction of the time needed for the OCR pipeline's character recognition functionality.
• Given the historic data target, another goal was to be able to deal with the common properties of the data, such as degraded paper quality, blurred text, or presence of noise. Problems, as seen with example text block image depicted in Figure 1, repeatedly appear in the dataset. • Furthermore, BENCH is advertised as a page segmenter, including layout analysis. The algorithm can however be passed a parameter, indicating that a single column of text (i.e. zero column seperators) is to be expected. Naturally this was the chosen configuration for comparison purposes. Since the single column property is always valid for the presegmented text blocks in the BnL dataset, a solution, purely designed for line detection within a single block, was favored. The overall aim was to develop a simple algorithm that only returns a set of line bounding boxes and does not incorporate additional features, such as general text direction (left-to-right, top-to-bottom, etc.) detection, for example.
Having laid out the context of this work, the remainder of this paper is structured in two sections, followed by concluding statements. The first one formally defines the algorithm, while reusing the same Figure 1 example input to illustrate intermediate results. The subsequent section evaluates the method and draws the comparison to BENCH.

II ALGORITHM
The literature has shown that text line segmentation methods can essentially be classified in two categories (Mechi et al. [2019]). First, well documented by Likforman-Sulem et al. [2007], there are image processing methods, such as connected-components, projections and smearing analysis. Second, there are learning-based methods, which are better suited for more complex layouts and heterogeneous documents (Kurar Barakat et al. [2020]).
Given the single-column BnL data, the interplay of morphology and histogram projections, which can lead to encouraging results (Dos Santos et al. [2009]), was retained for this work. The nature of combined techniques is also the reason why the proposed segmentation method is referred to as COMBISEG. In the following, different subsections will cover the algorithm's input requirements, the functioning of the segmentation itself and the resulting output.

Input Assumption
To define the scope of the approach and consequently clarify the input requirements, it is sufficient to state that the algorithm expects an input image fulfilling two properties: 1. The input image, in the following denoted as I b , must be presented in a binarized form, composed of ones (text/white) and zeros (background/black) only. This paper represents I b as a 2D array, with the height h(I b ) seen as the first, and the width w(I b ) as the second dimension. It follows that the highest indices are h(I b )−1 and w(I b )−1 respectively. To comply with this, BnL relies on Otsu's binarization method (Otsu [1979]) as implemented in the OpenCV library (Bradski [2000]).
2. COMBISEG is not designed to perform layout analysis (e.g. identify text blocks within a full newspaper page). That's why the second requirement is that I b is a block of text, solely containing horizontal and typed text lines (not tested for handwriting). In the example of the BnL use case, the employed dataset was already pre-segmented into blocks of text, mostly representing paragraphs or small articles.

Segmentation
Given the correct input, the actual algorithm can be be summed up as a sequence of three consecutive steps.
1. M orphology First, a set of morphological image operations is applied on I b , in order to create a new processed version, denoted as I p .

Components
Next, I p is used to search for connected components, with each component potentially already representing a single text line.

Histogram
Finally, a horizontal histogram projection is in turn created for each component, being the basis for possible additional text line splits.
The algorithm is also subject to an ordered set of parameters, allowing slight tuning and referred to as params = {p 1 , p 2 , p 3 , p 4 , p 5 , p 6 , p 7 , p 8 }.
Having laid out the high level plan for COMBISEG, this article will proceed with explanations regarding the precise working of the three steps involved.

Morphology
To begin with, a set of sequential image operations is applied to obtain a processed image version, with the intention that every connected pixel component represents a text line. For explanation purposes, this paper will, in the following, make use of the common image operations opening (•), dilation (⊕), addition (+), subtraction (−) and inversion ( −1 ). To provide an example, states that image I p is obtained by opening image I b using structuring element E w×h , which is filled with ones only, for width w and height h. Image operations are also considered to return a modified copy, rather than changing the image that's being operated on.
Algorithm II.1 Transforms binary I b into processed I p , featuring connected components as potential text lines.
Given the pseudocode seen in Algorithm II.1, the line statements can be explained as follows: • Line 2: Considered to be a preprocessing operation. The first statement of Algorithm II.1 tries to remove long vertical and horizontal elements (borders, lines, frames, etc.) from I b through opening, followed by subtraction. Hence, every element has to have minimum height/width of p 1 to be considered. The main motivation here is to deal with the potentially cropped nature of the input images. For instance, in the concrete example of I b having been cropped out of a newspaper page, imprecise croppings and the inclusion of neighbouring article separator lines can be the cause for segmentation algorithms to falsely combine multiple successive text lines. • Line 3: It follows that all text areas in the preprocessed version I p are horizontally dilated using structuring element E p 2 ×1 , in an attempt to concatenate consecutive characters of the same text line  • Line 4: The next statement starts by removing vertical elements from I p using an opening operation (protective step). This helps to prepare the background of the image (as seen in Figure 4), in a way that it should prevent the next step (horizontal dilation) to cause damage to the text areas. This next step consist of searching for horizontal background separators. Once found, they are dilated horizontally to increase their width. This should break unwanted connections between successive text lines (e.g. second and third text line in Figure 4), that might have been created by the very first dilation. The protective first step thus attempts to avoid that background separators are detected in front or after text lines, that don't extend over the entire image width, and consequently cut into the text areas. The result of the entire statement is stored in S, representing the separators between text line areas. • Line 5: Adding S to the inverted image version is equivalent to the subtraction from the regular one. This explains the process of the last algorithm statement. As mentioned in the previous paragraph, the separators attempt to break unwanted text area connections. The final I p of the example input image, as returned by Algorithm II.1, can be seen with Figure 6. Comparing with Figure 4, the addition of S ends up being the catalyst for the separation of text lines two and three.

Components
Using I p , all connected white pixel components are identified. For this purpose, the 4-wayconnectivity is used (i.e. diagonal pixels are not considered connections). Next, all component bounding boxes, using coordinates within I p , are determined and stored in a set of no particular order.
Let n denote the number of found components. To determine bounding box i of the ith component, also let X i and Y i represent the set of all x-coordinates and y-coordinates in I p , of all pixels belonging to the ith component. It is defined that Finally, the set boxes is used to store a first version of the bounding boxes that the algorithm will ultimately return. At the same time, bounding boxes that don't have minimum height p 6 are discarded in a way that The edge case, where boxes is bound to be an empty set, is covered through the assignment of if |boxes| = 0: Using the process described in this subsection, a comp function is defined as Figure 7: Visualization of n = 3 connected components with minimum height p 6 .

Histogram
The final but essential step of COMBISEG is the analysis of horizontal histogram projections, one created for every bounding box stored in boxes. The global projection of I b is denoted as H * , an array storing the number of white pixels in every row of the image, s.t.
The sub-projection is then passed to the analysis procedure (Algorithm II.2), a method proposed by Ptak et al. [2017], that analyzes the locations of peaks in the histogram data. Next, based on the identified peaks, the procedure suggests division y-coordinates to split the underlying connected component bounding box.
This is done by identifying peaks using a relative threshold α t , having a value proportional to the height of the highest location of the potential peak, which is equal to the height h at coord. The set C keeps track of already analyzed coordinates. Should the new range of peak coordinates R be disjoined with C, a new peak has been identified and its delimiters p start and p end are added to the peaks set P .
Algorithm II.2 Analyzes sub-projection and returns possible y divisor coordinates.
The analysis procedure in turn calls two different helper functions: Accepts an ordered dictionary sub-projection H i , a y-coordinate coord and the height threshold α t as input. The function computes and returns the delimiting coordinates p start and p end of the histogram peak located at coord. This is done by iteratively increasing/decreasing coord by one, depending on the bound that should be discovered. A delimiting coordinate has been found, once the value of H i , for the given new coordinate, falls below the relative threshold α t . It follows that  • valleys(H i , P ) The same sub-projection, together with all the computed peak delimiting coordinates P is passed here. Once the set P is ordered increasingly, the procedure computes splitting y-coordinates, representing valley locations in the histogram data. This is done by first discarding the first (lowest) and last (highest)  Bringing everything together, an updated set boxes u is computed as indicated in Algorithm II.4.

Output
Algorithm II.5 ultimately provides the interactions between all three major steps involved in COMBISEG.
Algorithm II.5 The COMBISEG high level algorithm. 1 What remains to be clarified is the call to adjust in Line 5. As a postprocessing procedure, the bounding boxes are modified one last time, in a way that: • They are sorted by increasing box i [1].
• Their height is increased using parameter p 8 , s.t. updated box i equals This increase is justified by the fact that the background dilation in morph can be the cause of a slight height reduction, which should be compensated in adjust. In addition, it can be that bounding boxes with a greater amount of padding are preferred.
• Lastly, adjust merges all pairs of bounding boxes where box i completely contains box j , in a way that Optionally, adjust provides the occasion to add custom rules defining the merging of vertically overlapping bounding boxes (see Subsection 3.2).
The result of the application of COMBISEG can be seen with Figure 10. In the case of the example input, the blue component in Figure 7 was the basis for a successful detection (by analysis) of two different peaks, which have been split and turned into individual bounding boxes (overcoming the potential background stamp issue).

Parameters
In order to adapt to a specific target dataset and its proprietary characteristics, the set of params can be tuned accordingly. The respective parameters, together with their usage description, are listed below: •

Value Determination
The parameter values, experimentally leading to the best results on a BnL test set (more details will follow in Section III), are given by , 90, 25, 35, 330, 14, 0.3, 5}.
To provide some guidance on the approach of identifying (near) optimum values, the following three-step course of action is proposed: 1. Largely independent of every other parameter in the set, are p 1 , p 6 and p 8 . Their values are generally determined by "preference choices". While p 1 should be considerably (depending on the aggressiveness of the desired preprocessing) greater than a regular text line height, p 6 should obviously be smaller to solely filter out noise.
2. Next, there are the morphology related parameters p 2 , p 3 , p 4 and p 5 . A similar order of magnitude, with respect to the ratio compared to the the average text line height (42.9 pixels for the BnL test set), should be chosen as a starting point in every case. Afterwards, a small amount of discrete values should be brute forced (e.g. deltas of −10, 0 and +10, leading to 3 4 = 81 individual runs) by applying all combinations on a relatively small test set. It is of course essential to evaluate COMBISEG on the output of morph and comp only, discarding the subsequent hist component.
3. Lastly, p 7 should be determined independently for a given set of fixed morphology values. Hence, this final parameter can be learned in a supervised way through the overall performance of COMBISEG.

III EXPERIMENTAL RESULTS
For the purpose of evaluating the performance of COMBISEG, a BnL ground truth dataset is used and referred to as being a set of tuples (I b , gt) ∈ GT . It contains 5,958 text block images cropped from historic newspapers with ground truth bounding boxes originating from human annotators. The number of text lines per sample ranges from 2 to 143 for a total of 114,625 lines. A Raw Ground Truth Pack, representing the copyright-free subset of GT , is available on the Open Data Platform of BnL (OCR Datasets section).

Evaluation Method
In order to run COMBISEG against GT , an evaluation method is required. Common comparison methods typically rely on the measures of recall/precision/F (Barakat et al. [2018]) or on the intersection over union (IoU) metric (Renton et al. [2017]).
Since COMBISEG represents a rather simple case with perfectly horizontal and squared bounding information, which generally extend over the entire width, the chosen loss function focuses on the number of unmatched text lines. Next to deducting the number of correctly matched lines from cardinality |gt| (recall), the function also punishes the detection of superfluous text lines (precision), while enforcing an overall upper bound of |gt|, in a way that loss(boxes u , gt) = min |gt|, |gt|−|matched| + max(0, |boxes u |−|gt|) , where matched = a ∈ gt ∃b ∈ boxes u s.t. abs Commenting on (13), a trivial definition that compares the middle y-coordinates of bounding boxes, is established. This way the loss function represents the correct identification of individual text lines (main challenge in the historic dataset), rather than focusing on perfect bounding box dimensions (e.g. IoU).
The threshold value θ is set to equal 14.3, representing a third of the average bounding box height in GT of 42.9 pixels. Hence, a matched bounding box still has considerable vertical shifting possibilities, reinforcing the main objective, being text line identification.
Using (12), the accuracy of COMBISEG can be compared to BENCH, by computing acc(algo) = 1 − (I b ,gt)∈GT loss algo(I b ), gt Naturally, a shortcoming of loss is that it does not account for partial text line croppings, resulting from inaccurate bounding box dimensions. However, no significant implication for the OCR accuracy is asserted for the BnL dataset. The conjecture is that the connected components part of COMBISEG generally helps to render rather precise bounding box dimensions.

Postprocessing
The conducted experiments required postprocessing to make both algorithms more comparable. This is backed by the fact that BENCH tends to pursue a higher degree of fragmentation. Multiple bounding boxes are generally generated for text that is horizontally aligned correctly, but features particular large spaces between words. Since BnL considers those cases to represent one single text line, custom rules had to be introduced to merge bounding boxes based on their vertical positions. Those rules are equally applied to the output of both, BENCH and COMBISEG (in the context of the call to adjust).
, with i = j, denote the starting y-coordinates of two successive bounding boxes in the list of all output boxes, ordered based on this value. A vertical overlap between box i and box j is defined as A merger to a combined single bounding box is performed, in case there is Commenting on (16), boxes are merged in case o represents more than 75% of the height of one of the single boxes, or more than 50% of the combined height of both boxes.

Results
By applying the constructed evaluation method of Section 3.1, together with the postprocessing rules of Section 3.
The implementation of COMBISEG is based on methods from the OpenCV library (Bradski [2000]) and can be accessed as part of the OCR project, which is publicly available. On the other hand, the implementation of BENCH was run as published with version 2.0.8 of the software.

IV CONCLUSION
In summary, the proposed segmentation method COMBISEG is a perfect fit for the BnL use case of segmenting newspaper articles into individual text lines. The algorithm performs well in terms of both, accuracy and speed, thus having a positive effect on the overall OCR initiative. On top of that, it has the advantage of not requiring a separate training phase, as it is the case for model-based machine learning approaches. Its overall robustness originates from the combination of two techniques, commonly used in the literature. Morphological operations and horizontal histogram projections are combined, in a way that they complete each other and cancel out their weaknesses.
• Generating connected components through dilation has the advantage of obtaining rather precise coordinate information. However, this technique is prone to be fooled by background noise or closely adjacent lines, leading to unwanted connections and potentially major segmentation mistakes.
• Histogram projections, on the other hand, are less sensitive to anomalies and mostly overcome them by considering the integrity of image rows. Nonetheless, evaluating the histogram data and finding the correct bounding box through peak/valley detection, is not a trivial job. This is where the preliminary morphology work is able to help out.
Analyzing the limitations of COMBISEG, it should be stated that input images with a strong degree of skew or rotation can be the cause for some inaccuracies, such as the partial inclusion of a successive second text line in the same bounding box. Given that the algorithm is designed to strictly return horizontal boxes, a future BnL road-map could involve the expansion of the preceding binarization step, to include more advanced image processing, such as deskewing, rotation and image cleaning.
Setting aside this reflection on future work, COMBIGSEG marks a promising segmentation algorithm candidate, especially for OCR purposes in relation with historic datasets that are in a binarized, single-column format.