For this project we will use Python to quickly extract Wikipedia articles from a 90 GB XML file.
You will need a compressed XML dump file, for example: enwiki-20211020-pages-articles-multistream.xml.bz2
and also the corresponding index file: enwiki-20211020-pages-articles-multistream-index.txt.bz2
The torrents on this page https://meta.wikimedia.org/wiki/Data_dump_torrents can be used to download the files.
The Wikipedia XML dump can be downloaded as a compressed multistream BZ2 file. When decompressed, the file contains all text from all Wikipedia articles in XML format. Compression reduces the XML file size from approximately 90 GB to 20 GB. However, compression also makes the data more difficult to work with, because it is encoded by a compression algorithm, and not stored in plain text.
Multistream BZ2 compression addresses this problem by batching the uncompressed file into chunks, compressing each chunk independently, and storing all of the compressed chunks as one multistream BZ2 file. A multistream BZ2 file is a collection of compressed chunks of the original file. The BZ2 file can be decompressed, yielding the original file, similar to other compression formats.
The advantage of multistream is that any chunk of the file can be decompressed independently, without decompressing the entire file. This allows us to decompress chunks on demand, and in parallel.
We will search the multistream index file to find the starting byte and number of bytes for a specific chunk of the BZ2 file. Here is some of the data in the index file:
$ head -105 'enwiki-latest-pages-articles-multistream-index.txt' | tail -10
616:589:Ashmore And Cartier Islands
616:590:Austin (disambiguation)
616:593:Animation
616:594:Apollo
616:595:Andre Agassi
631676:596:Artificial languages
631676:597:Austroasiatic languages
631676:598:Afro-asiatic languages
631676:599:Afroasiatic languages
631676:600:Andorra
The index file has three columns, and uses a colon :
as the delimiter. The first
column is the starting byte of the chunk containing the article, the second column is the article id number,
and the third column is the article title.
We will calculate a chunk's size by finding the difference between its starting byte and the starting byte
of the next chunk. For example, the Apollo article in the list above would be in a chunk that starts at byte 616
and has a size of 631676 - 616
bytes.
After decompressing the index file, it is almost 1 GB so we will need to search it without loading the entire file
into memory. One way to do this is by using the csv.reader()
method to iterate through the index file one line at a time.
The following function will search through the index file for the search_term
, and then return the starting byte and length
of the chunk that contains the article.
import csv
def search_index(search_term, index_filename):
byte_flag = False
data_length = start_byte = 0
index_file = open(index_filename, 'r')
csv_reader = csv.reader(index_file, delimiter=':')
for line in csv_reader:
if not byte_flag and search_term == line[2]:
start_byte = int(line[0])
byte_flag = True
elif byte_flag and int(line[0]) != start_byte:
data_length = int(line[0]) - start_byte
break
index_file.close()
return start_byte, data_length
If we call search_index()
with the search_term
"Apollo", the function will return (616, 631060)
as the (start_byte, data_length)
>>> search_term = 'Apollo'
>>> index_filename = 'enwiki-latest-pages-articles-multistream-index.txt'
>>> search_index(search_term, index_filename)
(616, 631060)
Now that we know the start byte and length of the chunk we want to decompress, we will read those specific 631 KB from the 20 GB compressed file, and write them to a temporary BZ2 file. Then we will decompress the temp file into a 2.4 MB XML file that contains the article we searched for.
import bz2
import shutil
def decompress_chunk(wiki_filename, start_byte, data_length):
temp_filename = 'chunk.bz2'
decomp_filename = 'chunk.xml'
with open(wiki_filename, 'rb') as wiki_file:
wiki_file.seek(start_byte)
data = wiki_file.read(data_length)
with open(temp_filename, 'wb') as temp_file:
temp_file.write(data)
with bz2.BZ2File(temp_filename) as fr, open(decomp_filename,"wb") as fw:
shutil.copyfileobj(fr,fw)
return decomp_filename
>>> search_term = 'Apollo'
>>> index_filename = 'enwiki-latest-pages-articles-multistream-index.txt'
>>> wiki_filename = 'enwiki-latest-pages-articles-multistream.xml.bz2'
>>> start_byte, data_length = search_index(search_term, index_filename)
>>> decompress_chunk(wiki_filename, start_byte, data_length)
'chunk.xml'
© alchemy.pub 2022 BTC: bc1qxwp3hamkrwp6txtjkavcsnak9dkj46nfm9vmef