3 minute read

Information Retrieval Programming Assignment - BM25 Search Engine

https://github.com/echo-cool/COMP3009J-Information-Retrival-BM25-Based-Search-Engine

Features

  • Implemented BM25 model and Precision, Recall, Precision@10, R-Precision, MAP, b_pref, NDCG for evaluation.
  • Automatically evaluate the program and print the evaluation metrics.
  • Automatically build and dump the index to a file, no need to read the whole dataset a second time.
  • Allowing compression to reduce the index file(from 211 MB -> 155 MB for large corpus, but this will affect the speed of saving and loading, so by default this feature is not enabled. If you want to have a try, please change the enableCompression to true in search-large-corpus.py(This flag is in the build_index function).
  • Pre-calculated BM25 score, thus achieves a higher query performance.
  • The index file is saved in index.json
  • The standard query result is saved in output.txt
  • The stemming and stopwords removal is cached in memory(using dict which is the HashMap in Python), thus improves the performance.

Instructions

I have attempted both of the small and large corpus, so there is a separated program for each of them.

  • The program for small corpus: search-small-corpus.py
  • The program for large corpus: search-large-corpus.py

Please copy the search-small-corpus.py to the small corpus folder and search-large-corpus.py to the large corpus folder. The structure of the corpus should as follows:

COMP3009J-corpus-small/
├──documents/
│  ├── ...
├──files/
│  ├── porter.py
│  ├── qrels.txt``
│  ├── queries.txt
│  ├── stopwords.txt
├──search-small-corpus.py

after running this program the index.json and output.txt should be generated.

COMP3009J-corpus-small/
├──documents/
│  ├── ...
├──files/
│  ├── porter.py
│  ├── qrels.txt``
│  ├── queries.txt
│  ├── stopwords.txt
├──index.json
├──output.txt
├──search-small-corpus.py

Same for the large corpus.

Small corpus

For small corpus, please switch to the director using this commends:

cd COMP3009J-corpus-small/

and run the program:

Automatic evaluation

python search-small-corpus.py -m evaluation

the output will be like:

> python search-small-corpus.py -m evaluation     
Loading index from file...
Build or loading index cost: 0.05389000000000001
Written to file: output.txt
Query cost: 0.20161800000000002
Precision    (0.41186858796527687)
Recall       (0.44108936779729624)
Precision@10 (0.26044444444444453)
R-Precision  (0.35333145124990645)
MAP          (0.33289896356986537)
b_pref       (0.3315311255637636)
NDCG_score   (0.3703139179229204)
Evaluate cost: 0.008191000000000004

Input query manually

input queries manually:

python search-small-corpus.py -m manual

the output will be like:

> python search-small-corpus.py -m manual
Loading index from file...
Build or loading index cost: 0.047557999999999996

Please input query.
>>> 

once you finished input, you can press enter and the program should display the result immediately:

Please input query.
>>> what similarity laws must be obeyed when constructing aeroelastic models of heated high speed aircraft
Time <Query>: 0.0016260000000000024
Query 1: what similarity laws must be obeyed when constructing aeroelastic models of heated high speed aircraft
  Query ID         Q0                         Doc ID    Ranking                Score           Student ID
         1         Q0                             51          1            27.834510             19206226
         1         Q0                            486          2            26.818613             19206226
         1         Q0                             12          3            23.858651             19206226
         1         Q0                            184          4            22.988724             19206226
         1         Q0                            878          5            21.514560             19206226
         1         Q0                            665          6            18.871190             19206226
         1         Q0                            573          7            18.665324             19206226
         1         Q0                            944          8            17.101055             19206226
         1         Q0                            141          9            16.551875             19206226
         1         Q0                            746         10            16.203965             19206226
         1         Q0                             78         11            16.123421             19206226
         1         Q0                             14         12            15.558147             19206226
         1         Q0                            453         13            14.795982             19206226
         1         Q0                            329         14            14.478334             19206226
         1         Q0                             13         15            14.468891             19206226

Large corpus

For large corpus, please switch to the director using this commends:

cd COMP3009J-corpus-large/

and run the program:

Automatic evaluation

python search-large-corpus.py -m evaluation

the output will be like:

> python search-large-corpus.py -m evaluation
Loading index from file...
Build or loading index cost: 3.209099
Written to file: output.txt
Query cost: 0.8714900000000001
Precision    (0.4235177378905466)
Recall       (0.896058451391523)
Precision@10 (0.5827160493827157)
R-Precision  (0.5296578389467479)
MAP          (0.5551753930814257)
b_pref       (0.40483167942246195)
NDCG_score   (0.5564884484005923)
Evaluate cost: 0.030048999999999992

Input query manually

input queries manually:

python search-large-corpus.py -m manual

the output will be like:

> python search-large-corpus.py -m manual
Loading index from file...
Build or loading index cost: 3.28638

Please input query.
>>> 

once you finished input, you can press enter and the program should display the result immediately:

Please input query.
>>> describe history oil industry
Time <Query>: 0.00796000000000019
Query 1: describe history oil industry
  Query ID         Q0                         Doc ID    Ranking                Score           Student ID
         1         Q0               GX232-43-0102505          1             9.044560             19206226
         1         Q0              GX255-56-12408598          2             8.675164             19206226
         1         Q0               GX229-87-1373283          3             8.621311             19206226
         1         Q0               GX253-41-3663663          4             8.543551             19206226
         1         Q0               GX064-43-9736582          5             8.511779             19206226
         1         Q0              GX268-35-11839875          6             8.482392             19206226
         1         Q0              GX231-53-10990040          7             8.411094             19206226
         1         Q0              GX262-28-10252024          8             8.351232             19206226
         1         Q0               GX063-18-3591274          9             8.346404             19206226
         1         Q0              GX263-63-13628209         10             8.226278             19206226
         1         Q0               GX253-57-7230055         11             8.225637             19206226
         1         Q0              GX262-86-10646381         12             8.145205             19206226
         1         Q0              GX000-48-10208090         13             8.140996             19206226
         1         Q0              GX128-96-12152039         14             8.137248             19206226
         1         Q0              GX255-59-12399984         15             8.127072             19206226