Apache lucene architecture3/16/2023 Although tokenisation libraries provide a set of methods for splitting text, users can implement their own rules using regular expressions. Tokenisation consists of splitting the text string or documents into tokens, or smaller chunks. Tokenisation is a fundamental concept of the Natural Language Processing (NLP) field, which is also being applied to search engines. ![]() Both algorithms are rooted in the concept of tokenisation. There are two main algorithms used for scoring: Term Frequency-Inverse Document Frequency (TD-IDF) and Best Match 25 (BM25). The third aspect encompasses the algorithms which calculate the scores. In fact, giving feedback to the algorithm on the user’s perception is a very fascinating field related to reinforcement learning techniques, which by itself would warrant a separate article. The second aspect concerns knowledge about a user’s satisfaction with the relevance score provided in the search. Even though different users perform the same search, the relevance results could be different for each of them therefore, the relevance should be aligned to the user expectations. The first aspect is related to the relevance definition itself. Three aspects should be considered for the scoring process: Similar to how the Google search engine works, the outcome of the scoring process is a sorted array containing all the search matches ordered by score. Scoring is the process that compares the user input against the stored documents and assigns relevance values for each result. The following table shows an example of an inverted index substructures build based on a set of documents:
0 Comments
Leave a Reply.AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |