• Guest
HabraHabr
  • Main
  • Users

  • Development
    • Programming
    • Information Security
    • Website development
    • JavaScript
    • Game development
    • Open source
    • Developed for Android
    • Machine learning
    • Abnormal programming
    • Java
    • Python
    • Development of mobile applications
    • Analysis and design of systems
    • .NET
    • Mathematics
    • Algorithms
    • C#
    • System Programming
    • C++
    • C
    • Go
    • PHP
    • Reverse engineering
    • Assembler
    • Development under Linux
    • Big Data
    • Rust
    • Cryptography
    • Entertaining problems
    • Testing of IT systems
    • Testing Web Services
    • HTML
    • Programming microcontrollers
    • API
    • High performance
    • Developed for iOS
    • CSS
    • Industrial Programming
    • Development under Windows
    • Image processing
    • Compilers
    • FPGA
    • Professional literature
    • OpenStreetMap
    • Google Chrome
    • Data Mining
    • PostgreSQL
    • Development of robotics
    • Visualization of data
    • Angular
    • ReactJS
    • Search technologies
    • Debugging
    • Test mobile applications
    • Browsers
    • Designing and refactoring
    • IT Standards
    • Solidity
    • Node.JS
    • Git
    • LaTeX
    • SQL
    • Haskell
    • Unreal Engine
    • Unity3D
    • Development for the Internet of things
    • Functional Programming
    • Amazon Web Services
    • Google Cloud Platform
    • Development under AR and VR
    • Assembly systems
    • Version control systems
    • Kotlin
    • R
    • CAD/CAM
    • Customer Optimization
    • Development of communication systems
    • Microsoft Azure
    • Perfect code
    • Atlassian
    • Visual Studio
    • NoSQL
    • Yii
    • Mono и Moonlight
    • Parallel Programming
    • Asterisk
    • Yandex API
    • WordPress
    • Sports programming
    • Lua
    • Microsoft SQL Server
    • Payment systems
    • TypeScript
    • Scala
    • Google API
    • Development of data transmission systems
    • XML
    • Regular expressions
    • Development under Tizen
    • Swift
    • MySQL
    • Geoinformation services
    • Global Positioning Systems
    • Qt
    • Dart
    • Django
    • Development for Office 365
    • Erlang/OTP
    • GPGPU
    • Eclipse
    • Maps API
    • Testing games
    • Browser Extensions
    • 1C-Bitrix
    • Development under e-commerce
    • Xamarin
    • Xcode
    • Development under Windows Phone
    • Semantics
    • CMS
    • VueJS
    • GitHub
    • Open data
    • Sphinx
    • Ruby on Rails
    • Ruby
    • Symfony
    • Drupal
    • Messaging Systems
    • CTF
    • SaaS / S+S
    • SharePoint
    • jQuery
    • Puppet
    • Firefox
    • Elm
    • MODX
    • Billing systems
    • Graphical shells
    • Kodobred
    • MongoDB
    • SCADA
    • Hadoop
    • Gradle
    • Clojure
    • F#
    • CoffeeScript
    • Matlab
    • Phalcon
    • Development under Sailfish OS
    • Magento
    • Elixir/Phoenix
    • Microsoft Edge
    • Layout of letters
    • Development for OS X
    • Forth
    • Smalltalk
    • Julia
    • Laravel
    • WebGL
    • Meteor.JS
    • Firebird/Interbase
    • SQLite
    • D
    • Mesh-networks
    • I2P
    • Derby.js
    • Emacs
    • Development under Bada
    • Mercurial
    • UML Design
    • Objective C
    • Fortran
    • Cocoa
    • Cobol
    • Apache Flex
    • Action Script
    • Joomla
    • IIS
    • Twitter API
    • Vkontakte API
    • Facebook API
    • Microsoft Access
    • PDF
    • Prolog
    • GTK+
    • LabVIEW
    • Brainfuck
    • Cubrid
    • Canvas
    • Doctrine ORM
    • Google App Engine
    • Twisted
    • XSLT
    • TDD
    • Small Basic
    • Kohana
    • Development for Java ME
    • LiveStreet
    • MooTools
    • Adobe Flash
    • GreaseMonkey
    • INFOLUST
    • Groovy & Grails
    • Lisp
    • Delphi
    • Zend Framework
    • ExtJS / Sencha Library
    • Internet Explorer
    • CodeIgniter
    • Silverlight
    • Google Web Toolkit
    • CakePHP
    • Safari
    • Opera
    • Microformats
    • Ajax
    • VIM
  • Administration
    • System administration
    • IT Infrastructure
    • *nix
    • Network technologies
    • DevOps
    • Server Administration
    • Cloud computing
    • Configuring Linux
    • Wireless technologies
    • Virtualization
    • Hosting
    • Data storage
    • Decentralized networks
    • Database Administration
    • Data Warehousing
    • Communication standards
    • PowerShell
    • Backup
    • Cisco
    • Nginx
    • Antivirus protection
    • DNS
    • Server Optimization
    • Data recovery
    • Apache
    • Spam and antispam
    • Data Compression
    • SAN
    • IPv6
    • Fidonet
    • IPTV
    • Shells
    • Administering domain names
  • Design
    • Interfaces
    • Web design
    • Working with sound
    • Usability
    • Graphic design
    • Design Games
    • Mobile App Design
    • Working with 3D-graphics
    • Typography
    • Working with video
    • Work with vector graphics
    • Accessibility
    • Prototyping
    • CGI (graphics)
    • Computer Animation
    • Working with icons
  • Control
    • Careers in the IT industry
    • Project management
    • Development Management
    • Personnel Management
    • Product Management
    • Start-up development
    • Managing the community
    • Service Desk
    • GTD
    • IT Terminology
    • Agile
    • Business Models
    • Legislation and IT-business
    • Sales management
    • CRM-systems
    • Product localization
    • ECM / EDS
    • Freelance
    • Venture investments
    • ERP-systems
    • Help Desk Software
    • Media management
    • Patenting
    • E-commerce management
    • Creative Commons
  • Marketing
    • Conferences
    • Promotion of games
    • Internet Marketing
    • Search Engine Optimization
    • Web Analytics
    • Monetize Web services
    • Content marketing
    • Monetization of IT systems
    • Monetize mobile apps
    • Mobile App Analytics
    • Growth Hacking
    • Branding
    • Monetize Games
    • Display ads
    • Contextual advertising
    • Increase Conversion Rate
  • Sundry
    • Reading room
    • Educational process in IT
    • Research and forecasts in IT
    • Finance in IT
    • Hakatonas
    • IT emigration
    • Education abroad
    • Lumber room
    • I'm on my way

Analysis of the quality of educational materials, or how it didn’t work for us

Analysis of the quality of educational materials, or how it didn’t work for us 3r3128. 3r3118.  
Good day. 3r3118.  
3r3118.  
Today I will tell you about the attempts to master the analysis of educational materials, the struggle for the quality of these documents and the disappointment that we have learned. "We" is a pair of students from MSTU. N. E. Bauman. If you're interested, welcome under the cat! 3r3118.  
3r3118.  
3r3116. Problem
3r3118.  
We were going to assess the quality of educational materials (guidelines, textbooks, etc.) by statistical indicators. There were quite a few such indicators, here are some of them: the deviation of the number of chapters from the “ideal” (equal to five), the average number of characters per page, the average number of schemes per page and so on in the list. Not so difficult, huh? But this was only the beginning, because further, if successful, we were waiting for the construction of ontology and semantic analysis. 3r3118.  
3r3118.  
3r3116. Tools and raw data
3r3118.  
The problem was in the source materials, and they were all sorts of manuals /textbooks in PDF. Rather, the problem was not even in the materials themselves, but in PDF and the quality of the conversion. 3r3118.  
To work with PDF, it was decided to use Python and some fancy youth library for which 3r-333 was chosen. pdfminer.six
. 3r3118.  
3r3118.  
3r3116. History
3r3118.  
In general, at first we tried different python libraries, but they were not very friendly with the Cyrillic alphabet, and our literature was written in Russian. In addition, the most simple libraries were able only to pull out the text, which was not enough for us. Having stopped on pdfminer.six, we began to prototype, experiment and have fun. Fortunately, there were enough examples for us to begin with. 3r3118.  
3r3118.  
We created our PDF documents with text, images, tables, and more. Everything was going well with us, we could easily pull out any element from our document. 3r3118.  
3r3118.  
This is what the document page looks like in our presentation 3r3118.  
3r3118.  
3r3355. 3r3118.  
3r3118.  
I will give a small example of interaction with the document: getting the text of the document. 3r3118.  
3r3118.  
3r3365. file = open (path, 'rb')
parser = PDFParser (file)
document = PDFDocument (parser)
output = StringIO ()
manager = PDFResourceManager ()
converter = TextConverter (manager, output, laparams = LAParams ())
interpreter = PDFPageInterpreter (manager, converter)
for page in PDFPage.get_pages (file):
interpreter.process_page (page)
converter.close ()
text = output.getvalue ()
output.close ()

3r3118.  
As you can see, getting the text from the document is quite simple. Any interaction is carried out according to the scheme below 3r3118.  
3r3118.  
3r3118.  
3r3118.  
3r3116. Why didn't it work out? 3r3117. 3r3118.  
All the experiments were successful and on the test PDF files everything was fine. As it turned out, breaking everything is a trivial task and the idea has broken about the harsh reality. 3r3118.  
After the experiments, we took a few real textbooks and found that anything can go wrong. 3r3118.  
3r3118.  
The first thing we noticed: the number of images counted by the program is not true, and parts of the text are simply lost. 3r3118.  
3r3118.  
It turned out that some (sometimes even many) parts of the text in the document were not presented as text and it is not known how this happened. This fact immediately dismissed the analysis of the frequency distribution of symbols /words /phrases, semantics, and indeed any other type of text analysis. 3r3118.  
3r3118.  
It is possible that when converting or creating these documents something unexpected happened, and it is possible that no one needed them to be formed “correctly”. Unfortunately, there was a majority of such materials, which led to disappointment in the idea of ​​such an analysis. 3r3118.  
3r3118.  
3r3116. Literature
3r3118.  
Documentation section 3r3121. from the pdfminer.six repository was used to write the article and as a reference. 3r3128.
3r3128.

It may be interesting

  • Comments
  • About article
  • Similar news
George Mike 27 May 2019 09:22
Thank you for this informative post! I am a student at LUMS University and learn about the analysis of the quality of education. Therefore I`m a writer at edubirdie legit to provide writing tips and much more. Also, have a look at my blog where I work as a writer/designer. You can find inspiration here to improve abilities.

weber

Author

9-10-2018, 20:36

Publication Date

Development / Programming

Category
  • Comments: 1
  • Views: 344
Saving data from a Linux partition
AlphaZero again beat Stockfish in a
Translation of Andrew Un’s Passion for
Free educational workshops at the
Algorithm: How to find the next
Microservices make the world easier
Write a comment
Name:*
E-Mail:


Comments

Here we introduce our top coupons that will help you for online shopping at discountable prices.Revounts bring you the best deals that slash the bills.If you are intrested in online shopping and want to save your savings then visit our site for best experience.
Today, 08:41

Emma Taylor

Global Dezigns is a Website Development Company in Karachi, Providing services of
website design in karachi
. We are delivering the best partnership across Pakistan. provides a complete range of web development services including web applications, website hosting and maintenance, domain registration, on-page search engine optimization, and website integration with social media platforms such as Facebook, Twitter, LinkedIn, Google Maps, and Google Local Directory. We believe we are well placed to take our knowledge and expertise to the logical next level with the latest web standards.  
  Show/hide text
https://www.globaldezigns.com/



Yesterday, 22:45

mike tomlin

This blog is really great. The information here will surely be of some help to me. Thanks!.mastering physics

Yesterday, 17:57

raymond weber

Coinyspace is the cryptocurrency community and trading forum where members can find any contributors of crypto ecosystem like currencies, exchanges & merchants. Check Out: Bitcoin Merchants
Yesterday, 16:57

noorseo

This is a great high resolution screen which you have shared for the users. Making a website is not an easy task but managing a good website is really a hard work. As far as this website is concerned, I am very happy.https://19216801.1
Yesterday, 16:01

nushra45

Adv
Website for web developers. New scripts, best ideas, programming tips. How to write a script for you here, we have a lot of information about various programming languages. You are a webmaster or a beginner programmer, it does not matter, useful articles will help to make your favorite business faster.

Login

Registration Forgot password