Web Scrapping

import requests
from bs4 import BeautifulSoup
url = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/DisplayAdvancedSearchResults.php?viewPage=0&ordenationCriteria=name"
response = requests.get(url)
html = BeautifulSoup(response.text, "html.parser")
print(html)

Here's what it does:

  1. import requests: Imports the requests library, which is used for making HTTP requests (like fetching web pages).
  2. from bs4 import BeautifulSoup: Imports the BeautifulSoup class from the bs4 library, which is used for parsing HTML and XML documents. 
  3. url = "...": Defines the URL of the web page you want to scrape.
  4. response = requests.get(url): Sends an HTTP GET request to the specified URL and stores the server's response in the response object.
  5. html = BeautifulSoup(response.text, "html.parser"):
    • response.text gets the HTML content of the response as a string.
    • BeautifulSoup(...) parses this HTML string using the "html.parser" (a built-in Python HTML parser) and creates a BeautifulSoup object. This object allows you to navigate and search the HTML.
  6. print(html): Prints the entire parsed HTML content to your console.
  • import requests
    from bs4 import BeautifulSoup
    url = "https://www.ucl.ac.uk/lbs/inventories/"
    response = requests.get(url)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "html.parser")
    soup = soup.find("table")

  • for row in soup.find_all("tr"):

  • dat = row.find("div", attrs={"class": "six columns small"})

  •       if dat:
    name = dat.a.get_text()
    link = dat.a.get("href")
    print(name + " https://www.ucl.ac.uk" + link)


  • import requests

    • This line imports the requests library. This library is essential for making HTTP requests (like sending a GET request to a website to fetch its content).
  • from bs4 import BeautifulSoup

    • This line imports the BeautifulSoup class from the bs4 library. BeautifulSoup is a powerful library for parsing HTML and XML documents, making it easy to extract data from them.
  • url = "https://www.ucl.ac.uk/lbs/inventories/"

    • This line defines a string variable url which holds the URL of the webpage we intend to scrape. In this case, it's the "Inventories" page on the UCL Legacies of British Slave-ownership (LBS) website.
  • response = requests.get(url)

    • This line sends an HTTP GET request to the url. The server's response (which includes the HTML content of the page, status codes, headers, etc.) is stored in the response object.
  • response.raise_for_status()

    • This is a crucial line for error handling. If the HTTP request was unsuccessful (e.g., a 404 Not Found error, a 500 Internal Server Error, etc.), this method will raise an HTTPError exception. This prevents the script from proceeding with potentially empty or malformed data and makes debugging easier.
  • soup = BeautifulSoup(response.text, "html.parser")

    • response.text retrieves the entire HTML content of the webpage as a string.
    • BeautifulSoup(response.text, "html.parser") then takes this HTML string and parses it using Python's built-in "html.parser". This creates a BeautifulSoup object (conventionally named soup), which represents the HTML document as a nested data structure, allowing you to navigate and search it easily.
  • soup = soup.find("table")

    • This line modifies the soup object. Instead of representing the entire HTML document, soup is now redefined to represent only the first <table> tag found within the original HTML document. This is an efficient way to narrow down the search to a specific part of the page if you know your target data is within a table.
  • for row in soup.find_all("tr"):

    • Now, the code iterates through each row (<tr> tag) found within the soup object (which currently represents the <table>).
    • soup.find_all("tr") returns a list of all <tr> (table row) tags within the table. The for loop then processes each <tr> element one by one.
  • dat = row.find("div", attrs={"class": "six columns small"})

    • Inside each <tr> (table row), this line attempts to find a <div> tag that has specific class attributes: "six columns small".
    • attrs={"class": "six columns small"} is used to specify that we're looking for a div tag whose class attribute exactly matches this string. This is a common way to target specific elements based on their CSS classes.
  • if dat:

    • This is a conditional check. If dat is not None (meaning a <div> with the specified classes was found within the current <tr>), then the code inside the if block will execute. If dat is None (no such div was found in that particular row), the code skips to the next row.
  • name = dat.a.get_text()

    • If dat was found, this line goes one level deeper. It looks for an <a> (anchor/link) tag inside the dat (the <div> element).
    • .get_text() extracts the visible text content from within that <a> tag. This text is assumed to be the "name" you want to extract.
  • link = dat.a.get("href")

    • Again, targeting the <a> tag inside dat.
    • .get("href") is used to extract the value of the href attribute from the <a> tag. The href attribute contains the URL that the link points to. This URL is stored in the link variable.
  • print(name + " https://www.ucl.ac.uk" + link)

    • This line prints the extracted name and the link to the console.
    • Notice " https://www.ucl.ac.uk" + link. This is because often, links on websites are relative paths (e.g., /lbs/person/view/123). To make them full, clickable URLs, the base URL of the website (https://www.ucl.ac.uk) is concatenated with the extracted relative link. This ensures that the printed link is a complete and usable URL.

  • import requests

    from bs4 import BeautifulSoup

    # definir o URL como variável

    url_base = "https://www.ucl.ac.uk/lbs/inventories/"

    #Arranjar um código que permita abrir e guardar o url

    response = requests.get(url_base)

    #Utilizar o BeautifulSoup que vai buscar todo o texto (response.text) e dentro do conteúdo da página vai fazer um parsing do html (html.parser)

    soup = BeautifulSoup(response.text,"html.parser")

    soup = soup.find("table")

    #Encontrar no html o que nos permite saltar de registo para registo,

    for row in soup.find_all("tr"): # dentro da tabela encontrar a tag tr

    dat = row.find_all("div",class_="six columns small")

    print(dat)


  • import requests: This line imports the requests library, which is used to make HTTP requests (like fetching web pages) from the internet.

  • from bs4 import BeautifulSoup: This line imports the BeautifulSoup object from the bs4 library. BeautifulSoup is excellent for parsing HTML and XML documents, making it easy to navigate and search their content.

  • url_base = "https://www.ucl.ac.uk/lbs/inventories/": This line defines a string variable url_base which holds the URL of the webpage the script will target.

  • response = requests.get(url_base): This line uses the requests.get() function to send an HTTP GET request to the url_base. The server's response (which includes the HTML content of the page) is stored in the response variable.

  • soup = BeautifulSoup(response.text,"html.parser"): Here, BeautifulSoup is used to parse the HTML content received from the response.

    • response.text gives us the HTML as a string.
    • "html.parser" tells BeautifulSoup to use Python's built-in HTML parser. The result is a BeautifulSoup object named soup that allows for easy searching and navigation of the HTML.
  • soup = soup.find("table"): This line is crucial. It modifies the soup object to only contain the first <table> tag found on the page and all its descendants. This means all subsequent searches (find_all) will be restricted to within that specific table, which is good for targeting data within a structured table.

  • for row in soup.find_all("tr"):: This loop iterates through each <tr> (table row) HTML tag found within the soup object (which is now just the table). For each row, the code will perform the operations inside the loop.

  • dat = row.find_all("div",class_="six columns small"): Inside the loop, for each <tr> (row), this line searches for all <div> (division) tags that have the exact class attribute class="six columns small". These div elements are likely where the specific data you're interested in is located. The results are stored in the dat variable as a list of BeautifulSoup tag objects.

  • print(dat): Finally, this line prints the dat variable to the console for each row. Since dat is a list of BeautifulSoup tag objects, you'll see the full HTML structure of the found div elements in your output.

  • Find LINKS

  • import requests # ver https://requests.readthedocs.io/en/latest/
    from bs4 import BeautifulSoup # ver https://www.crummy.com/software/BeautifulSoup/bs4/doc/
    url = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/DisplayAdvancedSearchResults.php?viewPage=0&ordenationCriteria=name"
    response = requests.get(url)
    html = BeautifulSoup(response.text, 'html.parser')
    links = html.find_all('a')
    url_base = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/"
    for link in links:
    if link.get_text() == "Ver Mais":
    link_a_abrir = url_base + link.attrs['href']
    print(link_a_abrir)

  • 1. Importing Libraries:

    • import requests: This line imports the requests library, which is used to make HTTP requests (like fetching web pages) from the internet.
    • from bs4 import BeautifulSoup: This line imports the BeautifulSoup object from the bs4 library. BeautifulSoup is excellent for parsing HTML and XML documents, making it easy to navigate and search their content.

    2. Defining the URL:

    • url = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/DisplayAdvancedSearchResults.php?viewPage=0&ordenationCriteria=name": This line defines a string variable url which holds the address of the webpage the script will target.

    3. Fetching the Webpage Content:

    • response = requests.get(url): This line uses the requests.get() function to send an HTTP GET request to the specified url. The server's response (which includes the HTML content of the page) is stored in the response variable.
    • html = BeautifulSoup(response.text, 'html.parser'): Here, BeautifulSoup is used to parse the HTML content received from the response. response.text gives us the HTML as a string, and 'html.parser' tells BeautifulSoup to use Python's built-in HTML parser. The result is a BeautifulSoup object (html) that we can easily search and navigate.

    4. Finding All Links:

    • links = html.find_all('a'): This line uses the find_all() method of the BeautifulSoup object to find all occurrences of the <a> (anchor) tag in the HTML. Anchor tags are used to define hyperlinks. All found <a> tags are stored in the links variable as a list.

    5. Defining the Base URL:

    • url_base = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/": This line defines a base URL. Since the links on the page might be relative (e.g., just the path without the domain), this url_base will be prepended to them to create complete, absolute URLs.

    6. Iterating and Extracting Specific Links:

    • for link in links:: This loop iterates through each <a> tag found in the links list.
    • if link.get_text() == "Ver Mais":: Inside the loop, this if condition checks if the visible text of the current link (link.get_text()) is exactly "Ver Mais" (Portuguese for "See More"). This is how the code identifies the specific links it's interested in.
    • link_a_abrir = url_base + link.attrs['href']: If the link's text is "Ver Mais", this line constructs the full URL.
      • link.attrs['href']: This accesses the href attribute of the <a> tag, which contains the URL the link points to.
      • url_base + ...: The url_base is concatenated with the value of the href attribute to form a complete URL. This complete URL is stored in the link_a_abrir variable.
    • print(link_a_abrir): Finally, this line prints the constructed, full URL of the "Ver Mais" link to the console.

    Open all links on the page:

  • import requests # ver https://requests.readthedocs.io/en/latest/
    from bs4 import BeautifulSoup # ver https://www.crummy.com/software/BeautifulSoup/bs4/doc/
    import time # ver https://docs.python.org/3/library/time.html#time.sleep
    url = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/DisplayAdvancedSearchResults.php?viewPage=0&ordenationCriteria=name"
    response = requests.get(url)
    html = BeautifulSoup(response.text, 'html.parser')
    links = html.find_all('a')
    url_base = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/"
    for link in links:
    if link.get_text() == "Ver Mais":
    link_a_abrir = url_base + link.attrs['href']
    abrir_link = requests.get(link_a_abrir)
    html_cada_link = BeautifulSoup(abrir_link.text, 'html.parser')
    print(link_a_abrir)
    print(html_cada_link)
    time.sleep(1)

  • 1. Importing Libraries:

    • import requests: Used to send HTTP requests to web servers (e.g., to fetch web pages).
    • from bs4 import BeautifulSoup: Essential for parsing HTML content, allowing you to navigate and extract data from web pages easily.
    • import time: This new import brings in the time module, which provides functions for time-related operations.

    2. Defining the Target URL:

    • url = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/DisplayAdvancedSearchResults.php?viewPage=0&ordenationCriteria=name": This line stores the URL of the initial webpage we want to scrape.

    3. Fetching the Initial Page Content:

    • response = requests.get(url): Sends a GET request to the url and stores the server's response.
    • html = BeautifulSoup(response.text, 'html.parser'): Parses the HTML content from the response using BeautifulSoup, making it easy to search for elements.

    4. Finding All Links on the Initial Page:

    • links = html.find_all('a'): This line finds all <a> (anchor) tags on the initial page. These tags typically represent hyperlinks.

    5. Defining the Base URL:

    • url_base = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/": This is a base URL used to construct complete URLs for any relative links found on the page.

    6. Iterating Through Links and Visiting "Ver Mais" Pages:

    • for link in links:: This loop iterates through every link (<a> tag) found on the initial page.
    • if link.get_text() == "Ver Mais":: Inside the loop, this condition checks if the visible text of the current link is "Ver Mais". This identifies the specific links of interest.
    • link_a_abrir = url_base + link.attrs['href']: If a "Ver Mais" link is found, this line constructs its full, absolute URL by combining the url_base with the link's href attribute value.
    • abrir_link = requests.get(link_a_abrir): This is a key new step. The script now sends another HTTP GET request, this time to the link_a_abrir (the full URL of the "Ver Mais" page). This effectively "opens" that link.
    • html_cada_link = BeautifulSoup(abrir_link.text, 'html.parser'): The HTML content of the newly opened "Ver Mais" page is then parsed by BeautifulSoup, allowing you to access its elements.
    • print(link_a_abrir): Prints the full URL of the "Ver Mais" page that was just opened.
    • print(html_cada_link): Prints the entire parsed HTML content of that specific "Ver Mais" page to the console. This lets you see the content of each linked page.
    • time.sleep(1): This is another crucial addition. It pauses the script's execution for 1 second.

  • Open Links and Collect data:

  • import requests # ver https://requests.readthedocs.io/en/latest/
    from bs4 import BeautifulSoup # ver https://www.crummy.com/software/BeautifulSoup/bs4/doc/
    import time # ver https://docs.python.org/3/library/time.html#time.sleep
    url = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/DisplayAdvancedSearchResults.php?viewPage=0&ordenationCriteria=name"
    response = requests.get(url)
    html = BeautifulSoup(response.text, 'html.parser')
    links = html.find_all('a')
    url_base = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/"
    for link in links:
    if link.get_text() == "Ver Mais":
    link_a_abrir = url_base + link.attrs['href']
    abrir_link = requests.get(link_a_abrir)
    html_cada_link = BeautifulSoup(abrir_link.text, 'html.parser')
    titulos = html_cada_link.find_all("td",)
    for titulo in titulos:
    nome_jornal = titulo.getText()
    linhas = html_cada_link.find_all("td")
    for linha in linhas:
    texto_da_linha = linha.getText()
    if texto_da_linha.find("Local de Edição: ") > -1:
    localidade = texto_da_linha.replace("Local de Edição: ", "")
    if texto_da_linha.find("Data de Início: ") > -1:
    data = texto_da_linha.replace("Data de Início: ", "")
    print(nome_jornal + "; " + localidade + "; " + str(data))
    time.sleep(1)

  • 1. Importing Libraries:

    • import requests: Used to send HTTP requests to web servers (e.g., to fetch web pages).
    • from bs4 import BeautifulSoup: Essential for parsing HTML content, allowing you to navigate and extract data from web pages easily.
    • import time: Provides functions for time-related operations, specifically time.sleep() to introduce delays.

    2. Defining the Target URL:

    • url = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/DisplayAdvancedSearchResults.php?viewPage=0&ordenationCriteria=name": This variable holds the URL of the initial webpage where the script starts its scraping process.

    3. Fetching and Parsing the Initial Page:

    • response = requests.get(url): Sends an HTTP GET request to the url and gets the response from the server.
    • html = BeautifulSoup(response.text, 'html.parser'): Parses the HTML content of the initial page into a BeautifulSoup object, making it searchable.

    4. Finding All Links and Defining Base URL:

    • links = html.find_all('a'): Finds all anchor (<a>) tags on the initial page, which represent hyperlinks.
    • url_base = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/": This is the base URL used to construct complete URLs from relative paths found in the links.

    5. Iterating Through "Ver Mais" Links and Extracting Data:

    • for link in links:: This loop goes through each link found on the initial page.
    • if link.get_text() == "Ver Mais":: This condition checks if the text content of the current link is exactly "Ver Mais". This filters for the specific detail pages the script wants to visit.
    • link_a_abrir = url_base + link.attrs['href']: If it's a "Ver Mais" link, this line constructs the full, absolute URL to that detail page by combining the url_base with the link's href attribute.
    • abrir_link = requests.get(link_a_abrir): The script sends another GET request, this time to the constructed detail page URL, effectively "opening" that specific page.
    • html_cada_link = BeautifulSoup(abrir_link.text, 'html.parser'): The HTML content of this individual detail page is then parsed into a new BeautifulSoup object.

    6. Extracting Data from Each Detail Page:

    • titulos = html_cada_link.find_all("td",): This line attempts to find all <td> (table data) tags on the detail page. In many web tables, the very first <td> or a <td> with specific attributes might contain the main title or name. The comma after "td" is unnecessary and won't cause an error, but it's not part of standard usage.
    • for titulo in titulos:: This loop iterates through the found <td> elements.
    • nome_jornal = titulo.getText(): For each <td> element, its text content is extracted and assigned to nome_jornal. Potential Issue: This assumes that the first <td> or any <td> iterated over will consistently be the journal name. Without knowing the exact HTML structure of the detail pages, this might extract incorrect information if the journal name isn't reliably the first <td> or if there are many generic <td> tags before it. A more robust approach would be to look for specific IDs, classes, or patterns in the HTML.
    • linhas = html_cada_link.find_all("td"): This line again finds all <td> tags on the detail page.
    • for linha in linhas:: This inner loop iterates through all <td> elements on the current detail page.
    • texto_da_linha = linha.getText(): Extracts the text content of the current <td> element.
    • if texto_da_linha.find("Local de Edição: ") > -1:: Checks if the string "Local de Edição: " exists within the texto_da_linha. find() returns the starting index if found, or -1 if not.
    • localidade = texto_da_linha.replace("Local de Edição: ", ""): If "Local de Edição: " is found, it's removed from the string, leaving only the actual location, which is then stored in localidade.
    • if texto_da_linha.find("Data de Início: ") > -1:: Similarly, checks for "Data de Início: " in the line.
    • data = texto_da_linha.replace("Data de Início: ", ""): If "Data de Início: " is found, it's removed, and the remaining text (the date) is stored in data.
    • print(nome_jornal + "; " + localidade + "; " + str(data)): Finally, it prints the extracted nome_jornal, localidade, and data, separated by semicolons. Potential Issue: The nome_jornal, localidade, and data variables are updated within nested loops. This means that nome_jornal might be set from the first <td>, but localidade and data are extracted from any <td> that contains their specific strings. The print statement will then use the last values assigned to these variables within the loops for that particular detail page. This approach can lead to incorrect pairings if the information isn't always in a very specific, consistent order within the <td> tags. It also prints for every <td> that happens to contain "Local de Edição:" or "Data de Início:", potentially causing redundant output.
    • time.sleep(1): Pauses the script for 1 second after processing each detail page. This is good practice for polite scraping and to avoid overloading the server or getting blocked.

    In essence, this script automates the process of digging deeper into a website, fetching data from linked pages, and attempting to parse specific fields from those pages. However, the data extraction part could be made more robust by using more specific HTML selectors (like id attributes or specific class names) if available, rather than relying solely on the general <td> tag and text content.


  • Collects data from all the links of all Websites:

  • import requests # ver https://requests.readthedocs.io/en/latest/
    from bs4 import BeautifulSoup # ver https://www.crummy.com/software/BeautifulSoup/bs4/doc/
    import time # ver https://docs.python.org/3/library/time.html#time.sleep
    import re # ver https://www.w3schools.com/python/python_regex.asp
    url = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/DisplayAdvancedSearchResults.php?viewPage=0&ordenationCriteria=name"
    response = requests.get(url)
    html = BeautifulSoup(response.text, 'html.parser')
    text = html.find('h4').getText()
    numero = re.findall(r"\d+", text)
    numero_de_paginas = int(numero[0]) / 20 + 1
    pagina = 0
    while pagina < numero_de_paginas:
    url = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/DisplayAdvancedSearchResults.php?viewPage=" + str(pagina) + "&ordenationCriteria=name"
    url_base = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/"
    response = requests.get(url)
    html = BeautifulSoup(response.text, 'html.parser')
    links = html.find_all('a')
    for link in links:
    if link.get_text() == "Ver Mais":
    link_a_abrir = url_base + link.attrs['href']
    abrir_link = requests.get(link_a_abrir)
    html = BeautifulSoup(abrir_link.text, 'html.parser')
    titulos = html.find_all("td",)
    for titulo in titulos:
    nome_jornal = titulo.getText()
    linhas = html.find_all("td")
    for linha in linhas:
    texto_da_linha = linha.getText()
    if texto_da_linha.find("Local de Edição: ") > -1:
    localidade = texto_da_linha.replace("Local de Edição: ", "")
    if texto_da_linha.find("Data de Início: ") > -1:
    data = texto_da_linha.replace("Data de Início: ", "")
    print(nome_jornal + "; " + localidade + "; " + str(data))
    time.sleep(1)
    pagina += 1


  • . Initial Setup and Page Count Calculation

    • import requests, from bs4 import BeautifulSoup, import time, import re: These lines import the necessary libraries.

      • requests handles sending HTTP requests.
      • BeautifulSoup parses HTML.
      • time allows pausing the script.
      • re (regular expressions) is new and used for pattern matching in strings.
    • url = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/DisplayAdvancedSearchResults.php?viewPage=0&ordenationCriteria=name": This is the URL of the initial search results page (page 0) that the script starts with.

    • response = requests.get(url): Fetches the HTML content of the initial page.

    • html = BeautifulSoup(response.text, 'html.parser'): Parses the fetched HTML using BeautifulSoup.

    • text = html.find('h4').getText(): This is a crucial new step for pagination. It finds the first <h4> HTML tag on the page and extracts its text content. This <h4> tag is presumed to contain information about the total number of search results (e.g., "Results 1 to 20 of 200").

    • numero = re.findall(r"\d+", text): The re module is used here. re.findall(r"\d+", text) searches the text extracted from the <h4> tag for all occurrences of one or more digits (\d+). It returns a list of all numbers found. For "Results 1 to 20 of 200", numero would likely be ['1', '20', '200'].

    • numero_de_paginas = int(numero[0]) / 20 + 1: This line calculates the total number of pages to scrape.

      • It assumes numero[0] (the first number found, e.g., '1' from "Results 1 to 20 of 200") actually represents the total count of items. Note: This calculation has a potential logical error. If numero[0] is '1', the total number of results found, it implies that the first number in the <h4> text represents the total items. If the text was "Results 1 to 20 of 200", then numero[2] (which would be '200') would be the correct number for total items. Assuming numero[0] is the total number of records, then int(numero[0]) converts it to an integer.
      • / 20: Divides the total number of items by 20, assuming 20 items per page.
      • + 1: Adds 1 to account for any partial last page. This gives a reasonable estimate of the number of pages to iterate through.

    2. Looping Through Pagination

    • pagina = 0: Initializes a variable pagina (page) to 0, representing the first page.

    • while pagina < numero_de_paginas:: This while loop controls the pagination. It continues as long as pagina is less than the calculated numero_de_paginas. This ensures the script visits every results page.

    • url = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/DisplayAdvancedSearchResults.php?viewPage=" + str(pagina) + "&ordenationCriteria=name": Inside the loop, the url is dynamically constructed for each page. str(pagina) inserts the current page number into the URL query parameter viewPage.

    • url_base = "https://portal.cehr.ft.lisboa.ucp.pt/BeliefAndCitizenship/BDImprensa/": The base URL for relative links is redefined in each iteration (though it could be defined once outside the loop).

    • response = requests.get(url): Fetches the HTML for the current results page.

    • html = BeautifulSoup(response.text, 'html.parser'): Parses the HTML of the current results page.

    3. Extracting Data from Individual Detail Pages

    This part of the code is largely similar to your previous version but now nested within the pagination loop:

    • links = html.find_all('a'): Finds all links on the current results page.

    • for link in links:: Iterates through each link on the current results page.

      • if link.get_text() == "Ver Mais":: Checks if the link's text is "Ver Mais".
      • link_a_abrir = url_base + link.attrs['href']: Constructs the full URL for the "Ver Mais" detail page.
      • abrir_link = requests.get(link_a_abrir): Fetches the HTML content of the detail page.
      • html = BeautifulSoup(abrir_link.text, 'html.parser'): Parses the HTML of the detail page. Important Note: This line reassigns the html variable, overwriting the BeautifulSoup object for the results page. This is generally okay as the code then proceeds to extract data from the html of the detail page, but it's something to be aware of in case you needed the results page html again later in this inner loop.
    • titulos = html.find_all("td",): This is a significant improvement for finding the journal name! Instead of just find_all("td"), it now searches for <td> tags with a very specific style attribute. This makes the extraction of nome_jornal much more robust and precise, as it targets the exact element that likely contains the journal name.

      • for titulo in titulos:: Iterates through the found title elements.
      • nome_jornal = titulo.getText(): Extracts the text content of the identified title <td>.
    • linhas = html.find_all("td"): Finds all <td> tags on the detail page (used for Local de Edição and Data de Início).

      • for linha in linhas:: Iterates through these <td> elements.
        • texto_da_linha = linha.getText(): Gets the text content of the current <td>.
        • if texto_da_linha.find("Local de Edição: ") > -1:: Checks for the "Local de Edição: " string.
        • localidade = texto_da_linha.replace("Local de Edição: ", ""): Extracts the location.
        • if texto_da_linha.find("Data de Início: ") > -1:: Checks for the "Data de Início: " string.
        • data = texto_da_linha.replace("Data de Início: ", ""): Extracts the date.
    • print(nome_jornal + "; " + localidade + "; " + str(data)): Prints the extracted journal name, location, and date, separated by semicolons. Potential Issue: As before, localidade and data are extracted in a loop that iterates through all <td> tags. If a detail page has multiple <td> elements containing the "Local de Edição: " or "Data de Início: " strings (which is unlikely but possible), or if their order is inconsistent, you might end up printing values that aren't perfectly aligned. However, for well-structured detail pages, this often works.

    • time.sleep(1): Pauses for 1 second after processing each detail page to avoid overwhelming the server.

    • pagina += 1: Increments the pagina counter, moving to the next results page for the while loop.

  • Crie o seu site grátis! Este site foi criado com a Webnode. Crie o seu gratuitamente agora! Comece agora