Analysis of the quality of educational materials, or how it didn’t work for us
3r3128. 3r3118.
Good day. 3r3118.
3r3118.
Today I will tell you about the attempts to master the analysis of educational materials, the struggle for the quality of these documents and the disappointment that we have learned. "We" is a pair of students from MSTU. N. E. Bauman. If you're interested, welcome under the cat! 3r3118.
3r3118.
3r3116. Problem
3r3118.
We were going to assess the quality of educational materials (guidelines, textbooks, etc.) by statistical indicators. There were quite a few such indicators, here are some of them: the deviation of the number of chapters from the “ideal” (equal to five), the average number of characters per page, the average number of schemes per page and so on in the list. Not so difficult, huh? But this was only the beginning, because further, if successful, we were waiting for the construction of ontology and semantic analysis. 3r3118.
3r3118.
3r3116. Tools and raw data
3r3118.
The problem was in the source materials, and they were all sorts of manuals /textbooks in PDF. Rather, the problem was not even in the materials themselves, but in PDF and the quality of the conversion. 3r3118.
To work with PDF, it was decided to use Python and some fancy youth library for which 3r-333 was chosen. pdfminer.six
. 3r3118.
3r3118.
3r3116. History
3r3118.
In general, at first we tried different python libraries, but they were not very friendly with the Cyrillic alphabet, and our literature was written in Russian. In addition, the most simple libraries were able only to pull out the text, which was not enough for us. Having stopped on pdfminer.six, we began to prototype, experiment and have fun. Fortunately, there were enough examples for us to begin with. 3r3118.
3r3118.
We created our PDF documents with text, images, tables, and more. Everything was going well with us, we could easily pull out any element from our document. 3r3118.
3r3118.
This is what the document page looks like in our presentation 3r3118.
3r3118.
3r3355. 3r3118.
3r3118.
I will give a small example of interaction with the document: getting the text of the document. 3r3118.
3r3118.
3r3365. file = open (path, 'rb')
parser = PDFParser (file)
document = PDFDocument (parser)
output = StringIO ()
manager = PDFResourceManager ()
converter = TextConverter (manager, output, laparams = LAParams ())
interpreter = PDFPageInterpreter (manager, converter)
for page in PDFPage.get_pages (file):
interpreter.process_page (page)
converter.close ()
text = output.getvalue ()
output.close ()
3r3118.
As you can see, getting the text from the document is quite simple. Any interaction is carried out according to the scheme below 3r3118.
3r3118. 3r3118.
3r3118.
3r3116. Why didn't it work out? 3r3117. 3r3118.
All the experiments were successful and on the test PDF files everything was fine. As it turned out, breaking everything is a trivial task and the idea has broken about the harsh reality. 3r3118.
After the experiments, we took a few real textbooks and found that anything can go wrong. 3r3118.
3r3118.
The first thing we noticed: the number of images counted by the program is not true, and parts of the text are simply lost. 3r3118.
3r3118.
It turned out that some (sometimes even many) parts of the text in the document were not presented as text and it is not known how this happened. This fact immediately dismissed the analysis of the frequency distribution of symbols /words /phrases, semantics, and indeed any other type of text analysis. 3r3118.
3r3118.
It is possible that when converting or creating these documents something unexpected happened, and it is possible that no one needed them to be formed “correctly”. Unfortunately, there was a majority of such materials, which led to disappointment in the idea of such an analysis. 3r3118.
3r3118.
3r3116. Literature
3r3118.
Documentation section 3r3121. from the pdfminer.six repository was used to write the article and as a reference. 3r3128.
3r3128.
It may be interesting
weber
Author9-10-2018, 20:36
Publication DateDevelopment / Programming
Category- Comments: 1
- Views: 344
Here we introduce our top coupons that will help you for online shopping at discountable prices.Revounts bring you the best deals that slash the bills.If you are intrested in online shopping and want to save your savings then visit our site for best experience.