Working With Multistream BZ2 Compression - The Wikipedia XML Dump

For this project we will use Python to quickly extract Wikipedia articles from a 90 GB XML file.


Download The Wikipedia XML Dump

You will need a compressed XML dump file, for example: enwiki-20211020-pages-articles-multistream.xml.bz2 and also the corresponding index file: enwiki-20211020-pages-articles-multistream-index.txt.bz2

The torrents on this page https://meta.wikimedia.org/wiki/Data_dump_torrents can be used to download the files.


Multistream BZ2 Compression

The Wikipedia XML dump can be downloaded as a compressed multistream BZ2 file. When decompressed, the file contains all text from all Wikipedia articles in XML format. Compression reduces the XML file size from approximately 90 GB to 20 GB. However, compression also makes the data more difficult to work with, because it is encoded by a compression algorithm, and not stored in plain text.

Multistream BZ2 compression addresses this problem by batching the uncompressed file into chunks, compressing each chunk independently, and storing all of the compressed chunks as one multistream BZ2 file. A multistream BZ2 file is a collection of compressed chunks of the original file. The BZ2 file can be decompressed, yielding the original file, similar to other compression formats.

The advantage of multistream is that any chunk of the file can be decompressed independently, without decompressing the entire file. This allows us to decompress chunks on demand, and in parallel.


Multistream BZ2 Indexing

We will search the multistream index file to find the starting byte and number of bytes for a specific chunk of the BZ2 file. Here is some of the data in the index file:

    
    $ head -105 'enwiki-latest-pages-articles-multistream-index.txt' | tail -10

    616:589:Ashmore And Cartier Islands
    616:590:Austin (disambiguation)
    616:593:Animation
    616:594:Apollo
    616:595:Andre Agassi
    631676:596:Artificial languages
    631676:597:Austroasiatic languages
    631676:598:Afro-asiatic languages
    631676:599:Afroasiatic languages
    631676:600:Andorra
    

The index file has three columns, and uses a colon : as the delimiter. The first column is the starting byte of the chunk containing the article, the second column is the article id number, and the third column is the article title.

We will calculate a chunk's size by finding the difference between its starting byte and the starting byte of the next chunk. For example, the Apollo article in the list above would be in a chunk that starts at byte 616 and has a size of 631676 - 616 bytes.


Searching The Index

After decompressing the index file, it is almost 1 GB so we will need to search it without loading the entire file into memory. One way to do this is by using the csv.reader() method to iterate through the index file one line at a time.

The following function will search through the index file for the search_term, and then return the starting byte and length of the chunk that contains the article.

    
    import csv

    def search_index(search_term, index_filename):
        byte_flag = False
        data_length = start_byte = 0
        index_file = open(index_filename, 'r')
        csv_reader = csv.reader(index_file, delimiter=':')
        for line in csv_reader:
            if not byte_flag and search_term == line[2]:
                start_byte = int(line[0])
                byte_flag = True
            elif byte_flag and int(line[0]) != start_byte:
                data_length = int(line[0]) - start_byte
                break
        index_file.close()
        return start_byte, data_length
    

If we call search_index() with the search_term "Apollo", the function will return (616, 631060) as the (start_byte, data_length)

    
    >>> search_term = 'Apollo'
    >>> index_filename = 'enwiki-latest-pages-articles-multistream-index.txt'
    >>> search_index(search_term, index_filename)
    (616, 631060)
    


Decompressing Multistream Chunks

Now that we know the start byte and length of the chunk we want to decompress, we will read those specific 631 KB from the 20 GB compressed file, and write them to a temporary BZ2 file. Then we will decompress the temp file into a 2.4 MB XML file that contains the article we searched for.

    
    import bz2
    import shutil

    def decompress_chunk(wiki_filename, start_byte, data_length):
        temp_filename = 'chunk.bz2'
        decomp_filename = 'chunk.xml'

        with open(wiki_filename, 'rb') as wiki_file:
            wiki_file.seek(start_byte)
            data = wiki_file.read(data_length)

        with open(temp_filename, 'wb') as temp_file:
            temp_file.write(data)

        with bz2.BZ2File(temp_filename) as fr, open(decomp_filename,"wb") as fw:
            shutil.copyfileobj(fr,fw)

        return decomp_filename
    

    
    >>> search_term = 'Apollo'
    >>> index_filename = 'enwiki-latest-pages-articles-multistream-index.txt'
    >>> wiki_filename = 'enwiki-latest-pages-articles-multistream.xml.bz2'
    >>> start_byte, data_length = search_index(search_term, index_filename)
    >>> decompress_chunk(wiki_filename, start_byte, data_length)
    'chunk.xml'
    



© alchemy.pub 2022 BTC: bc1qxwp3hamkrwp6txtjkavcsnak9dkj46nfm9vmef