Data Acquisition

Published in

Nerd For Tech

4 min readJun 8, 2021

Data acquisition is all about obtaining the artifacts that contain the input
data from a variety of sources, extracting the data from the artifacts, and
converting it into representations suitable for further processing.

The three main sources of data are the Internet (namely, the World Wide Web), databases, and local files (possibly previously downloaded by hand or using additional software).

Unstructured plain text in a natural language (such as English or Chinese)
Structured data, including: Tabular data in comma separated values (CSV) files, Tabular data from databases, Tagged data in HyperText Markup Language (HTML) or, in general, in eXtensible Markup Language (XML), Tagged data in JavaScript Object Notation (JSON)

Processing HTML Files

BeautifulSoup provides access to HTML tag attributes through a Python dictionary interface. If the object t represents a hyperlink (such as <a href=””>, then the string value of the destination of the hyperlink is t[“href”].string. Note that HTML tags are case-insensitive.

Perhaps the most useful soup functions are soup.find() and soup.find_all(), which find the first instance or all instances of a certain tag.

Reading the HTML file

For example, we make a request for an URL to be loaded into the python environment. Then use the HTML parser parameter to read the entire HTML file. Next, we print the first few lines of the HTML page.

Extracting Tag Value

We can extract the tag value from the first instance of the tag using the following code.

Handling CSV Files

CSV is a structured text file format used to store and move tabular or nearly tabular data. It dates back to 1972 and is a format of choice for Microsoft Excel, Apache OpenOffice Calc, and other spreadsheet software. Data.gov,1 a U.S. government website that provides access to publicly available data, alone provides 12,550 data sets in the CSV format.

Keep in mind that sometimes what looks like a delimiter is not a delimiter at all. To allow delimiter-like characters within a field as a part of the variable value (as in …,”Hello, world”,…), enclose the fields in quote characters.

Reading CSV file using csv module

The reader function is developed to take each row of the file and make a list of all columns. Then, you have to choose the column you want the variable data for.

Reading JSON Files

JSON is a lightweight data interchange format. Unlike pickle, JSON is language-independent but more restricted in terms of data representation.

JSON supports the following data types:

Atomic data types — strings, numbers, true, false, null
Arrays — an array corresponds to a Python list; it’s enclosed in square brackets []; the items in an array don’t have to be of the same data type: [1, 3.14, “a string”, true, null]
Objects — an object corresponds to a Python dictionary; it is enclosed in curly braces {}; every item consists of a key and a value, separated by a colon: {“age” : 37, “gender” : “male”, “married” : true}
Any recursive combinations of arrays, objects, and atomic data types (arrays of objects, objects with arrays as item values, and so on)

Database

Reading from a database is where the power of using something like SQLite. While we can query the entire table, we can instead just query a single column, or even based on specific row values.

Conclusion

I’m sure we already have an idea about what data science is, but it never
not only gain insight! Data science is the discipline of the extraction of knowledge from data. It relies on computer science (for data structures, algorithms, visualization, big data support, and general programming), statistics (for regressions and inference), and domain knowledge (for asking questions and interpreting results).

Regardless of the analysis type, data science is the first science and only then sorcery. As such, it is a process that follows a pretty rigorous basic sequence
that starts with data acquisition and ends with a report of the results.

Reference

Data Science Essentials in Python — Dimitry Zinoviev edited by Katharine Dvorak

Data Acquisition

Written by Arif Zainurrohman