Data Acquisition

Arif Zainurrohman
Nerd For Tech
Published in
4 min readJun 8, 2021

--

Data Acquisition

Data acquisition is all about obtaining the artifacts that contain the input
data from a variety of sources, extracting the data from the artifacts, and
converting it into representations suitable for further processing.

The three main sources of data are the Internet (namely, the World Wide Web), databases, and local files (possibly previously downloaded by hand or using additional software).

  1. Unstructured plain text in a natural language (such as English or Chinese)
  2. Structured data, including: Tabular data in comma separated values (CSV) files, Tabular data from databases, Tagged data in HyperText Markup Language (HTML) or, in general, in eXtensible Markup Language (XML), Tagged data in JavaScript Object Notation (JSON)

Processing HTML Files

HTML Tags and Attributes

BeautifulSoup provides access to HTML tag attributes through a Python dictionary interface. If the object t represents a hyperlink (such as <a href=””>, then the string value of the destination of the hyperlink is t[“href”].string. Note that HTML tags are case-insensitive.

Perhaps the most useful soup functions are soup.find() and soup.find_all(), which find the first instance or all instances of a certain tag.

Link

Reading the HTML file

For example, we make a request for an URL to be loaded into the python environment. Then use the HTML parser parameter to read the entire HTML file. Next, we print the first few lines of the HTML page.

Reading the HTML File

Extracting Tag Value

We can extract the tag value from the first instance of the tag using the following code.

Extract using HTML tag

Handling CSV Files

CSV is a structured text file format used to store and move tabular or nearly tabular data. It dates back to 1972 and is a format of choice for Microsoft Excel, Apache OpenOffice Calc, and other spreadsheet software. Data.gov,1 a U.S. government website that provides access to publicly available data, alone provides 12,550 data sets in the CSV format.

Keep in mind that sometimes what looks like a delimiter is not a delimiter at all. To allow delimiter-like characters within a field as a part of the variable value (as in …,”Hello, world”,…), enclose the fields in quote characters.

Detect Delimiter

Reading CSV file using csv module

The reader function is developed to take each row of the file and make a list of all columns. Then, you have to choose the column you want the variable data for.

Reading CSV file

Reading JSON Files

JSON is a lightweight data interchange format. Unlike pickle, JSON is language-independent but more restricted in terms of data representation.

JSON supports the following data types:

  • Atomic data types — strings, numbers, true, false, null
  • Arrays — an array corresponds to a Python list; it’s enclosed in square brackets []; the items in an array don’t have to be of the same data type: [1, 3.14, “a string”, true, null]
  • Objects — an object corresponds to a Python dictionary; it is enclosed in curly braces {}; every item consists of a key and a value, separated by a colon: {“age” : 37, “gender” : “male”, “married” : true}
  • Any recursive combinations of arrays, objects, and atomic data types (arrays of objects, objects with arrays as item values, and so on)
Reading JSON file

Database

Reading from a database is where the power of using something like SQLite. While we can query the entire table, we can instead just query a single column, or even based on specific row values.

Reading from database

Conclusion

I’m sure we already have an idea about what data science is, but it never
not only gain insight! Data science is the discipline of the extraction of knowledge from data. It relies on computer science (for data structures, algorithms, visualization, big data support, and general programming), statistics (for regressions and inference), and domain knowledge (for asking questions and interpreting results).

Regardless of the analysis type, data science is the first science and only then sorcery. As such, it is a process that follows a pretty rigorous basic sequence
that starts with data acquisition and ends with a report of the results.

Reference

Data Science Essentials in Python — Dimitry Zinoviev edited by Katharine Dvorak

--

--

Arif Zainurrohman
Nerd For Tech

Corporate Data Analytics. Enthusiast in all things data, personal finance, and Fintech.